Rearrangement Schema

A Rearrangement is a sequence which describes a rearranged adaptive immune receptor chain (e.g., antibody heavy chain or TCR beta chain) along with a host of annotations. These annotations are defined by the AIRR Rearrangement schema and comprises eight categories.

Category Description
Input The input sequence to the V(D)J assignment process.
Identifiers Primary and foreign key identifiers for linking AIRR data across files and databases.
Primary Annotations The primary outputs of the V(D)J assignment process, which includes the gene locus, V, D, J, and C gene calls, various flags, V(D)J junction sequence, copy number (duplicate_count), and the number of reads contributing to a consensus input sequence (consensus_count).
Alignment Annotations Detailed alignment annotations including the input and germline sequences used in the alignment; score, identity, statistical support (E-value, likelihood, etc); and the alignment itself through CIGAR strings for each aligned gene.
Alignment Positions The start/end positions for genes in both the input and germline sequences.
Region Sequence Sequence annotations for the framework regions (FWRs) and complementarity-determining regions (CDRs).
Region Positions Positional annotations for the framework regions (FWRs) and complementarity-determining regions (CDRs).
Junction Lengths Lengths for junction sub-regions associated with aspects of the V(D)J recombination process.

File Format Specification

The format specification describes the file format and details on how to structure this data.

Definition Clarifications

Junction versus CDR3

We work with the IMGT definitions of the junction and CDR3 regions. Specifically, the IMGT JUNCTION includes the conserved cysteine and tryptophan/phenylalanine residues, while CDR3 excludes those two residues. Therefore, our junction and junction_aa fields which represent the extracted sequence include the two conserved residues, while the coordinate fields (cdr3_start and cdr3_end) exclude them.

Productive

The schema does not define a strict definition of a productive rearrangement. However, the IMGT definition is recommended:

  1. Coding region has an open reading frame
  2. No defect in the start codon, splicing sites or regulatory elements.
  3. No internal stop codons.
  4. An in-frame junction region.

Locus names

A naming convention for locus names is not strictly enforced, but the IMGT locus names are recommended. For example, in the case of human data, this would be the set: IGH, IGK, IGL, TRA, TRB, TRD, or TRG.

Gene and allele names

Gene call examples use the IMGT nomenclature, but no specific gene or allele nomenclature is mandated. Species denotations may or may not be included in the gene name, as appropriate. For example, “Homo sapiens IGHV4-59*01”, “IGHV4-59*01” and “AB019438” are all valid entries for the same allele.

Alignments

There is no required alignment scheme for the nucleotide and amino acid alignment fields. These fields may, or may not, include numbering spacers (e.g., IMGT-numbering gaps), variations in case to denote mismatches, deletions, or other features appropriate to the tool that performed the alignment. The only strict requirement is that the query (“sequence”) and reference (“germline”) must be properly aligned.

Fields

The specification includes two classes of fields. Those that are required and those that are optional. Required is defined as a column that must be present in the header of the TSV. Optional is defined as column that may, or may not, appear in the TSV. All fields, including required fields, are nullable by assigning an empty string as the value. There are no requirements for column ordering in the schema, although the Python and R reference APIs enforce ordering for the sake of generating predictable output. The set of optional fields that provide alignment and region coordinates (“_start” and “_end” fields) are defined as 1- based closed intervals, similar to the SAM, VCF, GFF, IMGT, and INDSC formats (GenBank, ENA, and DDJB; http://www.insdc.org).

Most fields have strict definitions for the values that they contain. However, some commonly provided information cannot be standardized across diverse toolchains, so a small selection of fields have context-dependent definitions. In particular, these context-dependent fields include the optional “_score,” “_identity,” and “_support” fields used for assessing the quality of alignments which vary considerably in definition based on the methodology used. Similarly, the “_alignment” fields require strict alignment between the corresponding observed and germline sequences, but the manner in which that alignment is conveyed is somewhat flexible in that it allows for any numbering scheme (e.g., IMGT or KABAT) or lack thereof.

By default, data elements representing sequences in the schema contain nucleotide sequences except for data elements ending in “_aa,” which are amino acid translations of the associated nucleotide sequence.

While the format contains an extensive list of reserved field names, there are no restrictions on inclusion of custom fields in the TSV file, provided such custom fields have a unique name. Furthermore, suggestions for extending the format with additional reserved names are welcomed through the issue tracker on the GitHub repository (https://github.com/airr-community/airr-standards).

Download as TSV.

Name Type Priority Description
sequence_id string required Unique query sequence identifier within the file. Most often this will be the input sequence header or a substring thereof, but may also be a custom identifier defined by the tool in cases where query sequences have been combined in some fashion prior to alignment.
sequence string required The query nucleotide sequence. Usually, this is the unmodified input sequence, which may be reverse complemented if necessary. In some cases, this field may contain consensus sequences or other types of collapsed input sequences if these steps are performed prior to alignment.
sequence_aa string optional Amino acid translation of the query nucleotide sequence.
rev_comp boolean required True if the alignment is on the opposite strand (reverse complemented) with respect to the query sequence. If True then all output data, such as alignment coordinates and sequences, are based on the reverse complement of ‘sequence’.
productive boolean required True if the V(D)J sequence is predicted to be productive.
vj_in_frame boolean optional True if the V and J segment alignments are in-frame.
stop_codon boolean optional True if the aligned sequence contains a stop codon.
locus string optional Gene locus (chain type). For example, IGH, IGI, IGK, IGL, TRA, TRB, TRD, or TRG.
v_call string required V gene with allele. If referring to a known reference sequence in a database, such as IMGT/GENE-DB, the relevant gene/allele nomenclature should be followed (e.g., IGHV4-59*01).
d_call string required D gene with allele. If referring to a known reference sequence in a database, such as IMGT/GENE-DB, the relevant gene/allele nomenclature should be followed (e.g., IGHD3-10*01).
j_call string required J gene with allele. If referring to a known reference sequence in a database, such as IMGT/GENE-DB, the relevant gene/allele nomenclature should be followed (e.g., IGHJ4*02).
c_call string optional C region gene with allele. If referring to a known reference sequence in a database, such as IMGT/GENE-DB, the relevant gene/allele nomenclature should be followed (e.g., IGHM*01).
sequence_alignment string required Aligned portion of query sequence, including any indel corrections or numbering spacers, such as IMGT-gaps. Typically, this will include only the V(D)J region, but that is not a requirement.
sequence_alignment_aa string optional Amino acid translation of the aligned query sequence.
germline_alignment string required Assembled, aligned, full-length inferred germline sequence spanning the same region as the sequence_alignment field (typically the V(D)J region) and including the same set of corrections and spacers (if any).
germline_alignment_aa string optional Amino acid translation of the assembled germline sequence.
junction string required Junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons.
junction_aa string required Junction region amino acid sequence.
np1 string optional Nucleotide sequence of the combined N/P region between the V and D segments or V and J segments.
np1_aa string optional Amino acid translation of the np1 field.
np2 string optional Nucleotide sequence of the combined N/P region between the D and J segments.
np2_aa string optional Amino acid translation of the np2 field.
cdr1 string optional Nucleotide sequence of the aligned CDR1 region.
cdr1_aa string optional Amino acid translation of the cdr1 field.
cdr2 string optional Nucleotide sequence of the aligned CDR2 region.
cdr2_aa string optional Amino acid translation of the cdr2 field.
cdr3 string optional Nucleotide sequence of the aligned CDR3 region.
cdr3_aa string optional Amino acid translation of the cdr3 field.
fwr1 string optional Nucleotide sequence of the aligned FWR1 region.
fwr1_aa string optional Amino acid translation of the fwr1 field.
fwr2 string optional Nucleotide sequence of the aligned FWR2 region.
fwr2_aa string optional Amino acid translation of the fwr2 field.
fwr3 string optional Nucleotide sequence of the aligned FWR3 region.
fwr3_aa string optional Amino acid translation of the fwr3 field.
fwr4 string optional Nucleotide sequence of the aligned FWR4 region.
fwr4_aa string optional Amino acid translation of the fwr4 field.
v_score number optional Alignment score for the V gene.
v_identity number optional Fractional identity for the V gene alignment.
v_support number optional V gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the V gene assignment as defined by the alignment tool.
v_cigar string required CIGAR string for the V gene alignment.
d_score number optional Alignment score for the D gene alignment.
d_identity number optional Fractional identity for the D gene alignment.
d_support number optional D gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the D gene assignment as defined by the alignment tool.
d_cigar string required CIGAR string for the D gene alignment.
j_score number optional Alignment score for the J gene alignment.
j_identity number optional Fractional identity for the J gene alignment.
j_support number optional J gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the J gene assignment as defined by the alignment tool.
j_cigar string required CIGAR string for the J gene alignment.
c_score number optional Alignment score for the C gene alignment.
c_identity number optional Fractional identity for the C gene alignment.
c_support number optional C gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the C gene assignment as defined by the alignment tool.
c_cigar string optional CIGAR string for the C gene alignment.
v_sequence_start integer optional Start position of the V segment in the query sequence (1-based closed interval).
v_sequence_end integer optional End position of the V segment in the query sequence (1-based closed interval).
v_germline_start integer optional Alignment start position in the V gene reference sequence (1-based closed interval).
v_germline_end integer optional Alignment end position in the V gene reference sequence (1-based closed interval).
v_alignment_start integer optional Start position in the V segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
v_alignment_end integer optional End position in the V segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
d_sequence_start integer optional Start position of the D segment in the query sequence (1-based closed interval).
d_sequence_end integer optional End position of the D segment in the query sequence (1-based closed interval).
d_germline_start integer optional Alignment start position in the D gene reference sequence (1-based closed interval).
d_germline_end integer optional Alignment end position in the D gene reference sequence (1-based closed interval).
d_alignment_start integer optional Start position of the D segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
d_alignment_end integer optional End position of the D segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
j_sequence_start integer optional Start position of the J segment in the query sequence (1-based closed interval).
j_sequence_end integer optional End position of the J segment in the query sequence (1-based closed interval).
j_germline_start integer optional Alignment start position in the J gene reference sequence (1-based closed interval).
j_germline_end integer optional Alignment end position in the J gene reference sequence (1-based closed interval).
j_alignment_start integer optional Start position of the J segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
j_alignment_end integer optional End position of the J segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
cdr1_start integer optional CDR1 start position in the query sequence (1-based closed interval).
cdr1_end integer optional CDR1 end position in the query sequence (1-based closed interval).
cdr2_start integer optional CDR2 start position in the query sequence (1-based closed interval).
cdr2_end integer optional CDR2 end position in the query sequence (1-based closed interval).
cdr3_start integer optional CDR3 start position in the query sequence (1-based closed interval).
cdr3_end integer optional CDR3 end position in the query sequence (1-based closed interval).
fwr1_start integer optional FWR1 start position in the query sequence (1-based closed interval).
fwr1_end integer optional FWR1 end position in the query sequence (1-based closed interval).
fwr2_start integer optional FWR2 start position in the query sequence (1-based closed interval).
fwr2_end integer optional FWR2 end position in the query sequence (1-based closed interval).
fwr3_start integer optional FWR3 start position in the query sequence (1-based closed interval).
fwr3_end integer optional FWR3 end position in the query sequence (1-based closed interval).
fwr4_start integer optional FWR4 start position in the query sequence (1-based closed interval).
fwr4_end integer optional FWR4 end position in the query sequence (1-based closed interval).
v_sequence_alignment string optional Aligned portion of query sequence assigned to the V segment, including any indel corrections or numbering spacers.
v_sequence_alignment_aa string optional Amino acid translation of the v_sequence_alignment field.
d_sequence_alignment string optional Aligned portion of query sequence assigned to the D segment, including any indel corrections or numbering spacers.
d_sequence_alignment_aa string optional Amino acid translation of the d_sequence_alignment field.
j_sequence_alignment string optional Aligned portion of query sequence assigned to the J segment, including any indel corrections or numbering spacers.
j_sequence_alignment_aa string optional Amino acid translation of the j_sequence_alignment field.
c_sequence_alignment string optional Aligned portion of query sequence assigned to the constant region, including any indel corrections or numbering spacers.
c_sequence_alignment_aa string optional Amino acid translation of the c_sequence_alignment field.
v_germline_alignment string optional Aligned V gene germline sequence spanning the same region as the v_sequence_alignment field and including the same set of corrections and spacers (if any).
v_germline_alignment_aa string optional Amino acid translation of the v_germline_alignment field.
d_germline_alignment string optional Aligned D gene germline sequence spanning the same region as the d_sequence_alignment field and including the same set of corrections and spacers (if any).
d_germline_alignment_aa string optional Amino acid translation of the d_germline_alignment field.
j_germline_alignment string optional Aligned J gene germline sequence spanning the same region as the j_sequence_alignment field and including the same set of corrections and spacers (if any).
j_germline_alignment_aa string optional Amino acid translation of the j_germline_alignment field.
c_germline_alignment string optional Aligned constant region germline sequence spanning the same region as the c_sequence_alignment field and including the same set of corrections and spacers (if any).
c_germline_alignment_aa string optional Amino acid translation of the c_germline_aligment field.
junction_length integer optional Number of nucleotides in the junction sequence.
junction_aa_length integer optional Number of amino acids in the junction sequence.
np1_length integer optional Number of nucleotides between the V and D segments or V and J segments.
np2_length integer optional Number of nucleotides between the D and J segments.
n1_length integer optional Number of untemplated nucleotides 5’ of the D segment.
n2_length integer optional Number of untemplated nucleotides 3’ of the D segment.
p3v_length integer optional Number of palindromic nucleotides 3’ of the V segment.
p5d_length integer optional Number of palindromic nucleotides 5’ of the D segment.
p3d_length integer optional Number of palindromic nucleotides 3’ of the D segment.
p5j_length integer optional Number of palindromic nucleotides 5’ of the J segment.
consensus_count integer optional Number of reads contributing to the (UMI) consensus for this sequence. For example, the sum of the number of reads for all UMIs that contribute to the query sequence.
duplicate_count integer optional Copy number or number of duplicate observations for the query sequence. For example, the number of UMIs sharing an identical sequence or the number of identical observations of this sequence absent UMIs.
pair_id string optional Valid sequence_id that was determined by experimental or computational means to be associated with the current Rearrangement on the cellular level.
cell_id string optional Identifier defining the cell of origin for the query sequence.
clone_id string optional Clonal cluster assignment for the query sequence.
rearrangement_id string optional Identifier for the Rearrangement object. May be identical to sequence_id, but will usually be a univerally unique record locator for database applications.
repertoire_id string optional Identifier to the associated repertoire in study metadata.
data_processing_id string optional Identifier to the data processing object in the repertoire metadata for this rearrangement. If this field is empty than the primary data processing object is assumed.
germline_database string optional Source of germline V(D)J genes with version number or date accessed.