Rearrangement Schema

See the format overview for details on how to structure this data.

Definition Clarifications

Junction versus CDR3

We work with the IMGT definitions of the junction and CDR3 regions. Specifically, the IMGT JUNCTION includes the conserved cysteine and tryptophan/phenylalanine residues, while CDR3 excludes those two residues. Therefore, our junction and junction_aa fields which represent the extracted sequence include the two conserved residues, while the coordinate fields (cdr3_start and cdr3_end) exclude them.

Productive

The schema does not define a strict definition of a productive rearrangement. However, the IMGT definition is recommended:

  1. Coding region has an open reading frame
  2. No defect in the start codon, splicing sites or regulatory elements.
  3. No internal stop codons.
  4. An in-frame junction region.

Locus names

A naming convention for locus names is not strictly enforced, but the IMGT locus names are recommended. For example, in the case of human data, this would be the set: IGH, IGK, IGL, TRA, TRB, TRD, or TRG.

Gene and allele names

Gene call examples use the IMGT nomenclature, but no specific gene or allele nomenclature is mandated. Species denotations may or may not be included in the gene name, as appropriate. For example, “Homosap IGHV4-59*01”, “IGHV4-59*01” and “AB019438” are all valid entries for the same allele.

Alignments

There is no required alignment scheme for the nucleotide and amino acid alignment fields. These fields may, or may not, include numbering spacers (e.g., IMGT-numbering gaps), variations in case to denote mismatches, deletions, or other features appropriate to the tool that performed the alignment. The only strict requirement is that the query (“sequence”) and reference (“germline”) must be properly aligned.

Fields

Download as TSV.

Name Type Priority Description
sequence_id string required Unique query sequence identifier within the file. Most often this will be the input sequence header or a substring thereof, but may also be a custom identifier defined by the tool in cases where query sequences have been combined in some fashion prior to alignment.
sequence string required The query nucleotide sequence. Usually, this is the unmodified input sequence, which may be reverse complemented if necessary. In some cases, this field may contain consensus sequences or other types of collapsed input sequences if these steps are performed prior to alignment.
sequence_aa string optional Amino acid translation of the query nucleotide sequence.
rev_comp boolean required True if the alignment is on the opposite strand (reverse complemented) with respect to the query sequence. If True then all output data, such as alignment coordinates and sequences, are based on the reverse complement of ‘sequence’.
productive boolean required True if the V(D)J sequence is predicted to be productive.
vj_in_frame boolean optional True if the V and J segment alignments are in-frame.
stop_codon boolean optional True if the aligned sequence contains a stop codon.
locus string optional Gene locus (chain type). For example, IGH, IGK, IGL, TRA, TRB, TRD, or TRG.
v_call string required V segment gene with allele. For example, IGHV4-59*01.
d_call string required D segment gene with allele. For example, IGHD3-10*01.
j_call string required J segment gene with allele. For example, IGHJ4*02.
c_call string optional C region gene with allele. For example, IGHM*01.
sequence_alignment string required Aligned portion of query sequence, including any indel corrections or numbering spacers, such as IMGT-gaps. Typically, this will include only the V(D)J region, but that is not a requirement.
sequence_alignment_aa string optional Amino acid translation of the aligned query sequence.
germline_alignment string required Assembled, aligned, fully length inferred germline sequence spanning the same region as the sequence_alignment field (typically the V(D)J region) and including the same set of corrections and spacers (if any).
germline_alignment_aa string optional Amino acid translation of the assembled germline sequence.
junction string required Junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons.
junction_aa string required Junction region amino acid sequence.
np1 string optional Nucleotide sequence of the combined N/P region between the V and D segments or V and J segments.
np1_aa string optional Amino acid translation of the np1 field.
np2 string optional Nucleotide sequence of the combined N/P region between the D and J segments.
np2_aa string optional Amino acid translation of the np2 field.
cdr1 string optional Nucleotide sequence of the aligned CDR1 region.
cdr1_aa string optional Amino acid translation of the cdr1 field.
cdr2 string optional Nucleotide sequence of the aligned CDR2 region.
cdr2_aa string optional Amino acid translation of the cdr2 field.
cdr3 string optional Nucleotide sequence of the aligned CDR3 region.
cdr3_aa string optional Amino acid translation of the cdr3 field.
fwr1 string optional Nucleotide sequence of the aligned FWR1 region.
fwr1_aa string optional Amino acid translation of the fwr1 field.
fwr2 string optional Nucleotide sequence of the aligned FWR2 region.
fwr2_aa string optional Amino acid translation of the fwr2 field.
fwr3 string optional Nucleotide sequence of the aligned FWR3 region.
fwr3_aa string optional Amino acid translation of the fwr3 field.
fwr4 string optional Nucleotide sequence of the aligned FWR4 region.
fwr4_aa string optional Amino acid translation of the fwr4 field.
v_score number optional V segment alignment score.
v_identity number optional V segment alignment fractional identity.
v_support number optional V segment alignment E-value, p-value, likelihood, probability or other similar measure of support for the V segment assignment as defined by the alignment tool.
v_cigar string required V segment alignment CIGAR string.
d_score number optional D segment alignment score.
d_identity number optional D segment alignment fractional identity.
d_support number optional D segment alignment E-value, p-value, likelihood, probability or other similar measure of support for the D segment assignment as defined by the alignment tool.
d_cigar string required D segment alignment CIGAR string.
j_score number optional J segment alignment score.
j_identity number optional J segment alignment fractional identity.
j_support number optional J segment alignment E-value, p-value, likelihood, probability or other similar measure of support for the J segment assignment as defined by the alignment tool.
j_cigar string required J segment alignment CIGAR string.
c_score number optional C region alignment score.
c_identity number optional C region alignment fractional identity.
c_support number optional C region alignment E-value, p-value, likelihood, probability or other similar measure of support for the C region assignment as defined by the alignment tool.
c_cigar string optional C region alignment CIGAR string.
v_sequence_start integer optional Start position of the V segment in the query sequence (1-based closed interval).
v_sequence_end integer optional End position of the V segment in the query sequence (1-based closed interval).
v_germline_start integer optional Alignment start position in the V reference sequence (1-based closed interval).
v_germline_end integer optional Alignment end position in the V reference sequence (1-based closed interval).
v_alignment_start integer optional Start position in the V segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
v_alignment_end integer optional End position in the V segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
d_sequence_start integer optional Start position of the D segment in the query sequence (1-based closed interval).
d_sequence_end integer optional End position of the D segment in the query sequence (1-based closed interval).
d_germline_start integer optional Alignment start position in the D reference sequence (1-based closed interval).
d_germline_end integer optional Alignment end position in the D reference sequence (1-based closed interval).
d_alignment_start integer optional Start position of the D segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
d_alignment_end integer optional End position of the D segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
j_sequence_start integer optional Start position of the J segment in the query sequence (1-based closed interval).
j_sequence_end integer optional End position of the J segment in the query sequence (1-based closed interval).
j_germline_start integer optional Alignment start position in the J reference sequence (1-based closed interval).
j_germline_end integer optional Alignment end position in the J reference sequence (1-based closed interval).
j_alignment_start integer optional Start position of the J segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
j_alignment_end integer optional End position of the J segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
cdr1_start integer optional CDR1 start position in the query sequence (1-based closed interval).
cdr1_end integer optional CDR1 end position in the query sequence (1-based closed interval).
cdr2_start integer optional CDR2 start position in the query sequence (1-based closed interval).
cdr2_end integer optional CDR2 end position in the query sequence (1-based closed interval).
cdr3_start integer optional CDR3 start position in the query sequence (1-based closed interval).
cdr3_end integer optional CDR3 end position in the query sequence (1-based closed interval).
fwr1_start integer optional FWR1 start position in the query sequence (1-based closed interval).
fwr1_end integer optional FWR1 end position in the query sequence (1-based closed interval).
fwr2_start integer optional FWR2 start position in the query sequence (1-based closed interval).
fwr2_end integer optional FWR2 end position in the query sequence (1-based closed interval).
fwr3_start integer optional FWR3 start position in the query sequence (1-based closed interval).
fwr3_end integer optional FWR3 end position in the query sequence (1-based closed interval).
fwr4_start integer optional FWR3 start position in the query sequence (1-based closed interval).
fwr4_end integer optional FWR4 end position in the query sequence (1-based closed interval).
v_sequence_alignment string optional V segment aligned portion of query sequence, including any indel corrections or numbering spacers.
v_sequence_alignment_aa string optional Amino acid translation of the V segment aligned portion of the query sequence.
d_sequence_alignment string optional D segment aligned portion of query sequence, including any indel corrections or numbering spacers.
d_sequence_alignment_aa string optional Amino acid translation of the D segment aligned portion of the query sequence.
j_sequence_alignment string optional J segment aligned portion of query sequence, including any indel corrections or numbering spacers.
j_sequence_alignment_aa string optional Amino acid translation of the J segment aligned portion of the query sequence.
c_sequence_alignment string optional Constant region aligned portion of query sequence, including any indel corrections or numbering spacers.
c_sequence_alignment_aa string optional Amino acid translation of the constant region aligned portion of the query sequence.
v_germline_alignment string optional Aligned V segment germline sequence spaning the same region as the v_sequence_alignment field and including the same set of corrections and spacers (if any).
v_germline_alignment_aa string optional Amino acid translation of the align V segment germline sequence.
d_germline_alignment string optional Aligned D segment germline sequence spaning the same region as the d_sequence_alignment field and including the same set of corrections and spacers (if any).
d_germline_alignment_aa string optional Amino acid translation of the align D segment germline sequence.
j_germline_alignment string optional Aligned J segment germline sequence spaning the same region as the j_sequence_alignment field and including the same set of corrections and spacers (if any).
j_germline_alignment_aa string optional Amino acid translation of the align J segment germline sequence.
c_germline_alignment string optional Aligned constant region germline sequence spaning the same region as the c_sequence_alignment field and including the same set of corrections and spacers (if any).
c_germline_alignment_aa string optional Amino acid translation of the align constant region germline sequence.
junction_length integer optional Number of nucleotides in the junction sequence.
np1_length integer optional Number of nucleotides between the V and D segments or V and J segments.
np2_length integer optional Number of nucleotides between the D and J segments.
n1_length integer optional Number of untemplated nucleotides 5’ of the D segment.
n2_length integer optional Number of untemplated nucleotides 3’ of the D segment.
p3v_length integer optional Number of palindromic nucleotides 3’ of the V segment.
p5d_length integer optional Number of palindromic nucleotides 5’ of the D segment.
p3d_length integer optional Number of palindromic nucleotides 3’ of the D segment.
p5j_length integer optional Number of palindromic nucleotides 5’ of the J segment.
consensus_count integer optional Number of reads contributing to the (UMI) consensus for this sequence. For example, the sum of the number of reads for all UMIs that contribute to the query sequence.
duplicate_count integer optional Copy number or number of duplicate observations for the query sequence. For example, the number of UMIs sharing an identical sequence or the number of identical observations of this sequence absent UMIs.
cell_id string optional Identifier defining the cell of origin for the query sequence.
clone_id string optional Clonal cluster assignment for the query sequence.
rearrangement_id string optional Identifier for the Rearrangement object. May be identical to sequence_id, but will usually be a univerally unique record locator for database applications.
rearrangement_set_id string optional Identifier for grouping Rearrangement objects.
germline_database string optional Source of germline V(D)J segments, with version number or date accessed. For example, ‘IMGT/GENE-DB 3.1.18 (15 March 2018)’.