Repertoire Schema#

A Repertoire is an abstract organizational unit of analysis that is defined by the researcher and consists of study metadata, subject metadata, sample metadata, cell processing metadata, nucleic acid processing metadata, sequencing run metadata, a set of raw sequence files, data processing metadata, and a set of Rearrangements. A Repertoire gathers all of this information together into a composite object, which can be easily accessed by computer programs for data entry, analysis and visualization.

A Repertoire is specific to a single subject otherwise it can consist of any number of samples (which can be processed in different ways), any number of raw sequence files, and any number of rearrangements. It can also consist of any number of data processing metadata objects that describe the processing of raw sequence files into Rearrangements.

Typically, a Repertoire corresponds to the biological concept of the immune repertoire for that single subject which the researcher experimentally measures and computationally analyzes. However, researchers can have different interpretations about what constitutes the biological immune repertoire; therefore, the Repertoire schema attempts to be flexible and broadly useful for all AIRR-seq studies.

Another researcher can take the same raw sequencing data and associated metadata and create their own Repertoire that is different from the original researcher’s. A common example is to define a repertoire that is a subset such as “productive rearrangements for IGHV4” whereas the original researcher defined a more generic “B cell repertoire”. This new Repertoire would have much of the same metadata as the original Repertoire, except associated with a different study, and with additional information in the data processing metadata that describes how the rearrangements were filtered down to just the “productive rearrangements for IGHV4”. Likewise, another researcher may get access to the original biosample material and perform their own sample processing and sequencing, which also would be a new Repertoire. That new Repertoire could combine samples from the original researcher’s Repertoire with the new sample data as a large dataset for the subject.

Multiple Data Processing on a Repertoire#

Data processing can be a complicated multi-stage process. Documenting the process in a formal way is challenging because of the diversity of actions that may be performed. The MiAIRR standard requires documentation of the process but in an informal way with free text descriptions. A Repertoire might undergo multiple different data processing for any number of reasons, e.g. to compare the results from different toolchains, or to compare different settings for the same toolchain.

It is expected that all of the Samples of a Repertoire will be processed together within a DataProcessing. That is, a DataProcessing that only uses some but not all samples in a Repertoire could be confusing to users and appear as though data is missing. Likewise, processing some samples within a Repertoire with one DataProcessing and the remaining samples with a different DataProcessing could also confuse users. Because DataProcessing is unstructured information, it is not possible to validate that all Samples in a Repertoire are being processed together, so this expectation cannot be strictly enforced.

Having multiple DataProcessing for a Repertoire will create multiple sets of Rearrangements that are distinct and separate from each other. Analysis tools need to be careful not to mix these sets of Rearrangements from different DataProcessing because it can generate incorrect results. The identifier data_processing_id was added so Rearrangements can identify their specific DataProcessing.

Linking Data#

Each Repertoire has a unique repertoire_id identifier. This identifier should be globally unique so that repertoires from multiple studies can be combined together without conflict. The repertoire_id is used to link other AIRR data to a Repertoire. Specifically, the Rearrangements Schema includes repertoire_id for referencing the specific Repertoire for that Rearrangement.

If a Repertoire has multiple DataProcessing then data_processing_id should be used to distinguish the appropriate DataProcessing within the Repertoire. The Rearrangements contains data_processing_id for this purpose. The data_processing_id is only unique within a Repertoire so repertoire_id should first be used to get the appropriate Repertoire object and then data_processing_id used to acquire the appropriate DataProcessing.

It is expected that typical Repertoires might only have a single DataProcessing, in which case repertoire_id and data_processing_id will be semantically equivalent and only the former should be used.

If a Repertoire has multiple sample processing objects in the sample array then sample_processing_id should be used to distinguish the the approrpiate sample processing object within the Repertoire. The Rearrangement object can contain a sample_processing_id to uniquely identify a sample processing object within a Repertoire. Like data_processing_id, the sample_processing_id is only unique within the Repertoire so repertoire_id should first be used to get the appropiate Repertoire object and then sample_processing_id should be used to determine the appropiate sample processing object that is associated with the Rearrangement. If the Rearrangement object does not have a sample_processing_id then it can be assumed that the rearrangement is associated with all of the samples in the Repertoire (e.g. the rearrangement is a collapsed rearrangement across multiple samples).

It is expected that Repertoires might often have a single sample processing object, in which case repertoire_id and sample_processing_id will be semantically equivalent and only the former should be used.

Finally, if it is necessary to link a Rearrangement object with a unique pairing of sample processing and DataProcessing, the repertoire_id of the Rearrangement object should be used to identify the correct Repertoire object and then the data_processing_id should be used to identify the correct DataProcessing metadata and the sample_processing_id should be used to identify the correct sample processing metadata within that Repertoire.

Duality between Repertoires and Rearrangements#

There is an important duality relationship between Repertoires and Rearrangements, specifically with the experimental protocols described in the Repertoire versus the annotations on Rearrangements. A Repertoire defines an experimental design for what a researcher intends to measure or observe, while the Rearrangements are what was actually measured and observed. Technically, the border between the two occurs at sequencing, that is when the biological physical entity (prepared DNA) is measured and recorded as information (nucleotide sequence).

This duality is important when considering how to answer certain questions. For example, locus for Rearrangements may have the value “IGH” which indicates that B cell heavy chain receptors were measured, yet the Repertoire might have “T cell” in cell_subset which indicates the researcher intended to measure T cells. This conflict between the two indicates something is wrong. Differences can occur in many ways, as with errors in the experimental protocol, or data processing might have incorrectly processed the raw sequencing data leading to invalid annotations.

File Format Specification#

Files are YAML/JSON with a structure defined below. Files should be encoded as UTF-8. Identifiers are case-sensitive. Files should have the extension .yaml, .yml, or .json.

File Structure#

The file as a whole is considered a dictionary (key/value pair) structure with the keys Info and Repertoire.
The file can (optionally) contain an Info object, at the beginning of the file, based upon the Info schema in the OpenAPI V2 specification. If provided, version in Info should reference the version of the AIRR schema for the file.
The file should correspond to a list of Repertoire objects, using Repertoire as the key to the list.
Each Repertoire object should contain a top-level key/value pair for repertoire_id that uniquely identifies the repertoire.
Some fields require the use of a particular ontology or controlled vocabulary.
The structure is the same regardless of whether the data is stored in a file or a data repository. For example, The ADC API will return a properly structured JSON object that can be saved to a file and used directly without modification.

Repertoire Fields#

Download as TSV

Name	Type	Attributes	Definition
`repertoire_id`	string	optional, identifier, nullable	Identifier for the repertoire object. This identifier should be globally unique so that repertoires from multiple studies can be combined together without conflict. The repertoire_id is used to link other AIRR data to a Repertoire. Specifically, the Rearrangements Schema includes repertoire_id for referencing the specific Repertoire for that Rearrangement.
`repertoire_name`	string	optional, nullable	Short generic display name for the repertoire
`repertoire_description`	string	optional, nullable	Generic repertoire description
`study`	Study	required	Study object
`subject`	Subject	required	Subject object
`sample`	array of SampleProcessing	required	List of Sample Processing objects
`data_processing`	array of DataProcessing	required	List of Data Processing objects

Study Fields#

Download as TSV

Name	Type	Attributes	Definition
`study_id`	string	required, identifier, nullable	Unique ID assigned by study registry such as one of the International Nucleotide Sequence Database Collaboration (INSDC) repositories.
`study_title`	string	required, nullable	Descriptive study title
`study_type`	Ontology	required, nullable	Type of study design
`study_description`	string	optional, nullable	Generic study description
`inclusion_exclusion_criteria`	string	required, nullable	List of criteria for inclusion/exclusion for the study
`grants`	string	required, nullable	Funding agencies and grant numbers
`study_contact`	string	optional, nullable	Full contact information of the contact persons for this study This should include an e-mail address and a persistent identifier such as an ORCID ID.
`collected_by`	string	required, nullable	Full contact information of the data collector, i.e. the person who is legally responsible for data collection and release. This should include an e-mail address and a persistent identifier such as an ORCID ID.
`lab_name`	string	required, nullable	Department of data collector
`lab_address`	string	required, nullable	Institution and institutional address of data collector
`submitted_by`	string	required, nullable	Full contact information of the data depositor, i.e., the person submitting the data to a repository. This should include an e-mail address and a persistent identifier such as an ORCID ID. This is supposed to be a short-lived and technical role until the submission is relased.
`pub_ids`	string	required, nullable	Publications describing the rationale and/or outcome of the study. Where ever possible, a persistent identifier should be used such as a DOI or a Pubmed ID
`keywords_study`	array of string	required, nullable	Keywords describing properties of one or more data sets in a study. “contains_schema” keywords indicate that the study contains data objects from the AIRR Schema of that type (Rearrangement, Clone, Cell, Receptor) while the other keywords indicate that the study design considers the type of data indicated (e.g. it is possible to have a study that “contains_paired_chain” but does not “contains_schema_cell”).
`adc_publish_date`	string	optional, nullable	Date the study was first published in the AIRR Data Commons.
`adc_update_date`	string	optional, nullable	Date the study data was updated in the AIRR Data Commons.

Subject Fields#

Download as TSV

Name	Type	Attributes	Definition
`subject_id`	string	required, identifier, nullable	Subject ID assigned by submitter, unique within study. If possible, a persistent subject ID linked to an INSDC or similar repository study should be used.
`synthetic`	boolean	required	TRUE for libraries in which the diversity has been synthetically generated (e.g. phage display)
`species`	Ontology	required	Binomial designation of subject’s species
`organism`	Ontology	DEPRECATED	Binomial designation of subject’s species
`sex`	string	required, nullable	Biological sex of subject
`age_min`	number	required, nullable	Specific age or lower boundary of age range.
`age_max`	number	required, nullable	Upper boundary of age range or equal to age_min for specific age. This field should only be null if age_min is null.
`age_unit`	Ontology	required, nullable	Unit of age range
`age_event`	string	required, nullable	Event in the study schedule to which Age refers. For NCBI BioSample this MUST be sampling. For other implementations submitters need to be aware that there is currently no mechanism to encode to potential delta between Age event and Sample collection time, hence the chosen events should be in temporal proximity.
`age`	string	DEPRECATED
`ancestry_population`	Ontology	required, nullable	Broad geographic origin of ancestry (continent)
`location_birth`	Ontology	optional, nullable	Self-reported location of birth of the subject, preferred granularity is country-level
`ethnicity`	string	required, nullable	Ethnic group of subject (defined as cultural/language-based membership)
`race`	string	required, nullable	Racial group of subject (as defined by NIH)
`strain_name`	string	required, nullable	Non-human designation of the strain or breed of animal used
`linked_subjects`	string	required, nullable	Subject ID to which Relation type refers
`link_type`	string	required, nullable	Relation between subject and linked_subjects, can be genetic or environmental (e.g.exposure)
`diagnosis`	array of Diagnosis	optional	Diagnosis information for subject
`genotype`	SubjectGenotype	optional, nullable

Diagnosis Fields#

Download as TSV

Name	Type	Attributes	Definition
`study_group_description`	string	required, nullable	Designation of study arm to which the subject is assigned to
`disease_diagnosis`	Ontology	required, nullable	Diagnosis of subject
`disease_length`	string	required, nullable	Time duration between initial diagnosis and current intervention
`disease_stage`	string	required, nullable	Stage of disease at current intervention
`prior_therapies`	string	required, nullable	List of all relevant previous therapies applied to subject for treatment of Diagnosis
`immunogen`	string	required, nullable	Antigen, vaccine or drug applied to subject at this intervention
`intervention`	string	required, nullable	Description of intervention
`medical_history`	string	required, nullable	Medical history of subject that is relevant to assess the course of disease and/or treatment

Sample Fields#

Download as TSV

Name	Type	Attributes	Definition
`sample_id`	string	required, identifier, nullable	Sample ID assigned by submitter, unique within study. If possible, a persistent sample ID linked to INSDC or similar repository study should be used.
`sample_type`	string	required, nullable	The way the sample was obtained, e.g. fine-needle aspirate, organ harvest, peripheral venous puncture
`tissue`	Ontology	required, nullable	The actual tissue sampled, e.g. lymph node, liver, peripheral blood
`anatomic_site`	string	required, nullable	The anatomic location of the tissue, e.g. Inguinal, femur
`disease_state_sample`	string	required, nullable	Histopathologic evaluation of the sample
`collection_time_point_relative`	number	required, nullable	Time point at which sample was taken, relative to Collection time event
`collection_time_point_relative_unit`	Ontology	required, nullable	Unit of Sample collection time
`collection_time_point_reference`	string	required, nullable	Event in the study schedule to which Sample collection time relates to
`collection_location`	Ontology	optional, nullable	Location where the sample was taken, preferred granularity is country-level
`biomaterial_provider`	string	required, nullable	Name and address of the entity providing the sample

Tissue and Cell Processing Fields#

Download as TSV

Name	Type	Attributes	Definition
`tissue_processing`	string	required, nullable	Enzymatic digestion and/or physical methods used to isolate cells from sample
`cell_subset`	Ontology	required, nullable	Commonly-used designation of isolated cell population
`cell_phenotype`	string	required, nullable	List of cellular markers and their expression levels used to isolate the cell population
`cell_species`	Ontology	optional, nullable	Binomial designation of the species from which the analyzed cells originate. Typically, this value should be identical to species, in which case it SHOULD NOT be set explicitly. However, there are valid experimental setups in which the two might differ, e.g., chimeric animal models. If set, this key will overwrite the species information for all lower layers of the schema.
`single_cell`	boolean	required, nullable	TRUE if single cells were isolated into separate compartments
`cell_number`	integer	required, nullable	Total number of cells that went into the experiment
`cells_per_reaction`	integer	required, nullable	Number of cells for each biological replicate
`cell_storage`	boolean	required, nullable	TRUE if cells were cryo-preserved between isolation and further processing
`cell_quality`	string	required, nullable	Relative amount of viable cells after preparation and (if applicable) thawing
`cell_isolation`	string	required, nullable	Description of the procedure used for marker-based isolation or enrich cells
`cell_processing_protocol`	string	required, nullable	Description of the methods applied to the sample including cell preparation/ isolation/enrichment and nucleic acid extraction. This should closely mirror the Materials and methods section in the manuscript.

Nucleic Acid Processing Fields#

Download as TSV

Name	Type	Attributes	Definition
`template_class`	string	required	The class of nucleic acid that was used as primary starting material for the following procedures
`template_quality`	string	required, nullable	Description and results of the quality control performed on the template material
`template_amount`	number	required, nullable	Amount of template that went into the process
`template_amount_unit`	Ontology	required, nullable	Unit of template amount
`library_generation_method`	string	required	Generic type of library generation
`library_generation_protocol`	string	required, nullable	Description of processes applied to substrate to obtain a library that is ready for sequencing
`library_generation_kit_version`	string	required, nullable	When using a library generation protocol from a commercial provider, provide the protocol version number
`pcr_target`	array of PCRTarget	optional	If a PCR step was performed that specifically targets the IG/TR loci, the target and primer locations need to be provided here. This field holds an array of PCRTarget objects, so that multiplex PCR setups amplifying multiple loci at the same time can be annotated using one record per locus. PCR setups not targeting any specific locus must not annotate this field but select the appropriate library_generation_method instead.
`complete_sequences`	string	required	To be considered complete, the procedure used for library construction MUST generate sequences that 1) include the first V gene codon that encodes the mature polypeptide chain (i.e. after the leader sequence) and 2) include the last complete codon of the J gene (i.e. 1 bp 5’ of the J->C splice site) and 3) provide sequence information for all positions between 1) and 2). To be considered complete & untemplated, the sections of the sequences defined in points 1) to 3) of the previous sentence MUST be untemplated, i.e. MUST NOT overlap with the primers used in library preparation. mixed should only be used if the procedure used for library construction will likely produce multiple categories of sequences in the given experiment. It SHOULD NOT be used as a replacement of a NULL value.
`physical_linkage`	string	required	In case an experimental setup is used that physically links nucleic acids derived from distinct Rearrangements before library preparation, this field describes the mode of that linkage. All hetero_* terms indicate that in case of paired-read sequencing, the two reads should be expected to map to distinct IG/TR loci. _head-head refers to techniques that link the 5’ ends of transcripts in a single-cell context. _tail-head refers to techniques that link the 3’ end of one transcript to the 5’ end of another one in a single-cell context. This term does not provide any information whether a continuous reading-frame between the two is generated. *_prelinked refers to constructs in which the linkage was already present on the DNA level (e.g. scFv).

PCR Target Locus Fields#

Download as TSV

Name	Type	Attributes	Definition
`pcr_target_locus`	string	required, nullable	Designation of the target locus. Note that this field uses a controlled vocubulary that is meant to provide a generic classification of the locus, not necessarily the correct designation according to a specific nomenclature.
`forward_pcr_primer_target_location`	string	required, nullable	Position of the most distal nucleotide templated by the forward primer or primer mix
`reverse_pcr_primer_target_location`	string	required, nullable	Position of the most proximal nucleotide templated by the reverse primer or primer mix

Raw Sequence Data Fields#

Download as TSV

Sequencing Run Fields#

Download as TSV

Name	Type	Attributes	Definition
`sequencing_run_id`	string	required, identifier, nullable	ID of sequencing run assigned by the sequencing facility
`total_reads_passing_qc_filter`	integer	required, nullable	Number of usable reads for analysis
`sequencing_platform`	string	required, nullable	Designation of sequencing instrument used
`sequencing_facility`	string	required, nullable	Name and address of sequencing facility
`sequencing_run_date`	string	required, nullable	Date of sequencing run
`sequencing_kit`	string	required, nullable	Name, manufacturer, order and lot numbers of sequencing kit
`sequencing_files`	SequencingData	optional	Set of sequencing files produced by the sequencing run

Data Processing Fields#

Download as TSV

Name	Type	Attributes	Definition
`data_processing_id`	string	optional, identifier, nullable	Identifier for the data processing object.
`primary_annotation`	boolean	optional, identifier	If true, indicates this is the primary or default data processing for the repertoire and its rearrangements. If false, indicates this is a secondary or additional data processing.
`software_versions`	string	required, nullable	Version number and / or date, include company pipelines
`paired_reads_assembly`	string	required, nullable	How paired end reads were assembled into a single receptor sequence
`quality_thresholds`	string	required, nullable	How/if sequences were removed from (4) based on base quality scores
`primer_match_cutoffs`	string	required, nullable	How primers were identified in the sequences, were they removed/masked/etc?
`collapsing_method`	string	required, nullable	The method used for combining multiple sequences from (4) into a single sequence in (5)
`data_processing_protocols`	string	required, nullable	General description of how QC is performed
`data_processing_files`	array of string	optional, nullable	Array of file names for data produced by this data processing.
`germline_database`	string	required, nullable	Source of germline V(D)J genes with version number or date accessed.
`germline_set_ref`	string	optional, nullable	Unique identifier of the germline set and version, in standardized form (Repo:Label:Version)
`analysis_provenance_id`	string	optional, nullable	Identifier for machine-readable PROV model of analysis provenance

Repertoire Schema

Contents

Repertoire Schema#

Multiple Data Processing on a Repertoire#

Linking Data#

Duality between Repertoires and Rearrangements#

File Format Specification#

File Structure#

Repertoire Fields#

Study Fields#

Subject Fields#

Diagnosis Fields#

Sample Fields#

Tissue and Cell Processing Fields#

Nucleic Acid Processing Fields#

PCR Target Locus Fields#

Raw Sequence Data Fields#

Sequencing Run Fields#

Data Processing Fields#