Rearrangement Schema¶
See the format overview for details on how to structure this data.
Definition Clarifications¶
Junction versus CDR3
We work with the IMGT definitions of the junction and CDR3 regions. Specifically,
the IMGT JUNCTION
includes the conserved cysteine and tryptophan/phenylalanine
residues, while CDR3
excludes those two residues. Therefore, our junction
and junction_aa
fields which represent the extracted sequence include the two
conserved residues, while the coordinate fields (cdr3_start
and cdr3_end
)
exclude them.
Productive
The schema does not define a strict definition of a productive rearrangement. However, the IMGT definition is recommended:
- Coding region has an open reading frame
- No defect in the start codon, splicing sites or regulatory elements.
- No internal stop codons.
- An in-frame junction region.
Locus names
A naming convention for locus names is not strictly enforced, but the IMGT locus names are recommended. For example, in the case of human data, this would be the set: IGH, IGK, IGL, TRA, TRB, TRD, or TRG.
Gene and allele names
Gene call examples use the IMGT nomenclature, but no specific gene or allele nomenclature is mandated. Species denotations may or may not be included in the gene name, as appropriate. For example, “Homo sapiens IGHV4-59*01”, “IGHV4-59*01” and “AB019438” are all valid entries for the same allele.
Alignments
There is no required alignment scheme for the nucleotide and amino acid alignment fields. These fields may, or may not, include numbering spacers (e.g., IMGT-numbering gaps), variations in case to denote mismatches, deletions, or other features appropriate to the tool that performed the alignment. The only strict requirement is that the query (“sequence”) and reference (“germline”) must be properly aligned.
Fields¶
Name | Type | Priority | Description |
---|---|---|---|
sequence_id |
string |
required | Unique query sequence identifier within the file. Most often this will be the input sequence header or a substring thereof, but may also be a custom identifier defined by the tool in cases where query sequences have been combined in some fashion prior to alignment. |
sequence |
string |
required | The query nucleotide sequence. Usually, this is the unmodified input sequence, which may be reverse complemented if necessary. In some cases, this field may contain consensus sequences or other types of collapsed input sequences if these steps are performed prior to alignment. |
sequence_aa |
string |
optional | Amino acid translation of the query nucleotide sequence. |
rev_comp |
boolean |
required | True if the alignment is on the opposite strand (reverse complemented) with respect to the query sequence. If True then all output data, such as alignment coordinates and sequences, are based on the reverse complement of ‘sequence’. |
productive |
boolean |
required | True if the V(D)J sequence is predicted to be productive. |
vj_in_frame |
boolean |
optional | True if the V and J segment alignments are in-frame. |
stop_codon |
boolean |
optional | True if the aligned sequence contains a stop codon. |
locus |
string |
optional | Gene locus (chain type). For example, IGH, IGK, IGL, TRA, TRB, TRD, or TRG. |
v_call |
string |
required | V gene with allele. For example, IGHV4-59*01. |
d_call |
string |
required | D gene with allele. For example, IGHD3-10*01. |
j_call |
string |
required | J gene with allele. For example, IGHJ4*02. |
c_call |
string |
optional | C region gene with allele. For example, IGHM*01. |
sequence_alignment |
string |
required | Aligned portion of query sequence, including any indel corrections or numbering spacers, such as IMGT-gaps. Typically, this will include only the V(D)J region, but that is not a requirement. |
sequence_alignment_aa |
string |
optional | Amino acid translation of the aligned query sequence. |
germline_alignment |
string |
required | Assembled, aligned, fully length inferred germline sequence spanning the same region as the sequence_alignment field (typically the V(D)J region) and including the same set of corrections and spacers (if any). |
germline_alignment_aa |
string |
optional | Amino acid translation of the assembled germline sequence. |
junction |
string |
required | Junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons. |
junction_aa |
string |
required | Junction region amino acid sequence. |
np1 |
string |
optional | Nucleotide sequence of the combined N/P region between the V and D segments or V and J segments. |
np1_aa |
string |
optional | Amino acid translation of the np1 field. |
np2 |
string |
optional | Nucleotide sequence of the combined N/P region between the D and J segments. |
np2_aa |
string |
optional | Amino acid translation of the np2 field. |
cdr1 |
string |
optional | Nucleotide sequence of the aligned CDR1 region. |
cdr1_aa |
string |
optional | Amino acid translation of the cdr1 field. |
cdr2 |
string |
optional | Nucleotide sequence of the aligned CDR2 region. |
cdr2_aa |
string |
optional | Amino acid translation of the cdr2 field. |
cdr3 |
string |
optional | Nucleotide sequence of the aligned CDR3 region. |
cdr3_aa |
string |
optional | Amino acid translation of the cdr3 field. |
fwr1 |
string |
optional | Nucleotide sequence of the aligned FWR1 region. |
fwr1_aa |
string |
optional | Amino acid translation of the fwr1 field. |
fwr2 |
string |
optional | Nucleotide sequence of the aligned FWR2 region. |
fwr2_aa |
string |
optional | Amino acid translation of the fwr2 field. |
fwr3 |
string |
optional | Nucleotide sequence of the aligned FWR3 region. |
fwr3_aa |
string |
optional | Amino acid translation of the fwr3 field. |
fwr4 |
string |
optional | Nucleotide sequence of the aligned FWR4 region. |
fwr4_aa |
string |
optional | Amino acid translation of the fwr4 field. |
v_score |
number |
optional | Alignment score for the V gene. |
v_identity |
number |
optional | Fractional identity for the V gene alignment. |
v_support |
number |
optional | V gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the V gene assignment as defined by the alignment tool. |
v_cigar |
string |
required | CIGAR string for the V gene alignment. |
d_score |
number |
optional | Alignment score for the D gene alignment. |
d_identity |
number |
optional | Fractional identity for the D gene alignment. |
d_support |
number |
optional | D gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the D gene assignment as defined by the alignment tool. |
d_cigar |
string |
required | CIGAR string for the D gene alignment. |
j_score |
number |
optional | Alignment score for the J gene alignment. |
j_identity |
number |
optional | Fractional identity for the J gene alignment. |
j_support |
number |
optional | J gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the J gene assignment as defined by the alignment tool. |
j_cigar |
string |
required | CIGAR string for the J gene alignment. |
c_score |
number |
optional | Alignment score for the C gene alignment. |
c_identity |
number |
optional | Fractional identity for the C gene alignment. |
c_support |
number |
optional | C gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the C gene assignment as defined by the alignment tool. |
c_cigar |
string |
optional | CIGAR string for the C gene alignment. |
v_sequence_start |
integer |
optional | Start position of the V segment in the query sequence (1-based closed interval). |
v_sequence_end |
integer |
optional | End position of the V segment in the query sequence (1-based closed interval). |
v_germline_start |
integer |
optional | Alignment start position in the V gene reference sequence (1-based closed interval). |
v_germline_end |
integer |
optional | Alignment end position in the V gene reference sequence (1-based closed interval). |
v_alignment_start |
integer |
optional | Start position in the V segment in both the sequence_alignment and germline_alignment fields (1-based closed interval). |
v_alignment_end |
integer |
optional | End position in the V segment in both the sequence_alignment and germline_alignment fields (1-based closed interval). |
d_sequence_start |
integer |
optional | Start position of the D segment in the query sequence (1-based closed interval). |
d_sequence_end |
integer |
optional | End position of the D segment in the query sequence (1-based closed interval). |
d_germline_start |
integer |
optional | Alignment start position in the D gene reference sequence (1-based closed interval). |
d_germline_end |
integer |
optional | Alignment end position in the D gene reference sequence (1-based closed interval). |
d_alignment_start |
integer |
optional | Start position of the D segment in both the sequence_alignment and germline_alignment fields (1-based closed interval). |
d_alignment_end |
integer |
optional | End position of the D segment in both the sequence_alignment and germline_alignment fields (1-based closed interval). |
j_sequence_start |
integer |
optional | Start position of the J segment in the query sequence (1-based closed interval). |
j_sequence_end |
integer |
optional | End position of the J segment in the query sequence (1-based closed interval). |
j_germline_start |
integer |
optional | Alignment start position in the J gene reference sequence (1-based closed interval). |
j_germline_end |
integer |
optional | Alignment end position in the J gene reference sequence (1-based closed interval). |
j_alignment_start |
integer |
optional | Start position of the J segment in both the sequence_alignment and germline_alignment fields (1-based closed interval). |
j_alignment_end |
integer |
optional | End position of the J segment in both the sequence_alignment and germline_alignment fields (1-based closed interval). |
cdr1_start |
integer |
optional | CDR1 start position in the query sequence (1-based closed interval). |
cdr1_end |
integer |
optional | CDR1 end position in the query sequence (1-based closed interval). |
cdr2_start |
integer |
optional | CDR2 start position in the query sequence (1-based closed interval). |
cdr2_end |
integer |
optional | CDR2 end position in the query sequence (1-based closed interval). |
cdr3_start |
integer |
optional | CDR3 start position in the query sequence (1-based closed interval). |
cdr3_end |
integer |
optional | CDR3 end position in the query sequence (1-based closed interval). |
fwr1_start |
integer |
optional | FWR1 start position in the query sequence (1-based closed interval). |
fwr1_end |
integer |
optional | FWR1 end position in the query sequence (1-based closed interval). |
fwr2_start |
integer |
optional | FWR2 start position in the query sequence (1-based closed interval). |
fwr2_end |
integer |
optional | FWR2 end position in the query sequence (1-based closed interval). |
fwr3_start |
integer |
optional | FWR3 start position in the query sequence (1-based closed interval). |
fwr3_end |
integer |
optional | FWR3 end position in the query sequence (1-based closed interval). |
fwr4_start |
integer |
optional | FWR3 start position in the query sequence (1-based closed interval). |
fwr4_end |
integer |
optional | FWR4 end position in the query sequence (1-based closed interval). |
v_sequence_alignment |
string |
optional | Aligned portion of query sequence assigned to the V segment, including any indel corrections or numbering spacers. |
v_sequence_alignment_aa |
string |
optional | Amino acid translation of the v_sequence_alignment field. |
d_sequence_alignment |
string |
optional | Aligned portion of query sequence assigned to the D segment, including any indel corrections or numbering spacers. |
d_sequence_alignment_aa |
string |
optional | Amino acid translation of the d_sequence_alignment field. |
j_sequence_alignment |
string |
optional | Aligned portion of query sequence assigned to the J segment, including any indel corrections or numbering spacers. |
j_sequence_alignment_aa |
string |
optional | Amino acid translation of the j_sequence_alignment field. |
c_sequence_alignment |
string |
optional | Aligned portion of query sequence assigned to the constant region, including any indel corrections or numbering spacers. |
c_sequence_alignment_aa |
string |
optional | Amino acid translation of the c_sequence_alignment field. |
v_germline_alignment |
string |
optional | Aligned V gene germline sequence spanning the same region as the v_sequence_alignment field and including the same set of corrections and spacers (if any). |
v_germline_alignment_aa |
string |
optional | Amino acid translation of the v_germline_alignment field. |
d_germline_alignment |
string |
optional | Aligned D gene germline sequence spanning the same region as the d_sequence_alignment field and including the same set of corrections and spacers (if any). |
d_germline_alignment_aa |
string |
optional | Amino acid translation of the d_germline_alignment field. |
j_germline_alignment |
string |
optional | Aligned J gene germline sequence spanning the same region as the j_sequence_alignment field and including the same set of corrections and spacers (if any). |
j_germline_alignment_aa |
string |
optional | Amino acid translation of the j_germline_alignment field. |
c_germline_alignment |
string |
optional | Aligned constant region germline sequence spanning the same region as the c_sequence_alignment field and including the same set of corrections and spacers (if any). |
c_germline_alignment_aa |
string |
optional | Amino acid translation of the c_germline_aligment field. |
junction_length |
integer |
optional | Number of nucleotides in the junction sequence. |
np1_length |
integer |
optional | Number of nucleotides between the V and D segments or V and J segments. |
np2_length |
integer |
optional | Number of nucleotides between the D and J segments. |
n1_length |
integer |
optional | Number of untemplated nucleotides 5’ of the D segment. |
n2_length |
integer |
optional | Number of untemplated nucleotides 3’ of the D segment. |
p3v_length |
integer |
optional | Number of palindromic nucleotides 3’ of the V segment. |
p5d_length |
integer |
optional | Number of palindromic nucleotides 5’ of the D segment. |
p3d_length |
integer |
optional | Number of palindromic nucleotides 3’ of the D segment. |
p5j_length |
integer |
optional | Number of palindromic nucleotides 5’ of the J segment. |
consensus_count |
integer |
optional | Number of reads contributing to the (UMI) consensus for this sequence. For example, the sum of the number of reads for all UMIs that contribute to the query sequence. |
duplicate_count |
integer |
optional | Copy number or number of duplicate observations for the query sequence. For example, the number of UMIs sharing an identical sequence or the number of identical observations of this sequence absent UMIs. |
cell_id |
string |
optional | Identifier defining the cell of origin for the query sequence. |
clone_id |
string |
optional | Clonal cluster assignment for the query sequence. |
rearrangement_id |
string |
optional | Identifier for the Rearrangement object. May be identical to sequence_id, but will usually be a univerally unique record locator for database applications. |
rearrangement_set_id |
string |
optional | Identifier for grouping Rearrangement objects. |
germline_database |
string |
optional | Source of germline V(D)J genes with version number or date accessed. For example, ‘IMGT/GENE-DB 3.1.18 (15 March 2018)’. |