Rearrangement Schema

See the format overview for details on how to structure this data.

Definition Clarifications

Junction versus CDR3

We work with the IMGT definitions of the junction and CDR3 regions. Specifically, the IMGT JUNCTION includes the conserved cysteine and tryptophan/phenylalanine residues, while CDR3 excludes those two residues. Therefore, our junction and junction_aa fields which represent the extracted sequence include the two conserved residues, while the coordinate fields (cdr3_start and cdr3_end) exclude them.

Productive

The schema does not define a strict definition of a productive rearrangement. However, the IMGT definition is recommended:

  1. Coding region has an open reading frame
  2. No defect in the start codon, splicing sites or regulatory elements.
  3. No internal stop codons.
  4. An in-frame junction region.

Locus names

A naming convention for locus names is not strictly enforced, but the IMGT locus names are recommended. For example, in the case of human data, this would be the set: IGH, IGK, IGL, TRA, TRB, TRD, or TRG.

Gene and allele names

Gene call examples use the IMGT nomenclature, but no specific gene or allele nomenclature is mandated. Species denotations may or may not be included in the gene name, as appropriate. For example, “Homo sapiens IGHV4-59*01”, “IGHV4-59*01” and “AB019438” are all valid entries for the same allele.

Alignments

There is no required alignment scheme for the nucleotide and amino acid alignment fields. These fields may, or may not, include numbering spacers (e.g., IMGT-numbering gaps), variations in case to denote mismatches, deletions, or other features appropriate to the tool that performed the alignment. The only strict requirement is that the query (“sequence”) and reference (“germline”) must be properly aligned.

Fields

Download as TSV.

Name Type Priority Description
sequence_id string required Unique query sequence identifier within the file. Most often this will be the input sequence header or a substring thereof, but may also be a custom identifier defined by the tool in cases where query sequences have been combined in some fashion prior to alignment.
sequence string required The query nucleotide sequence. Usually, this is the unmodified input sequence, which may be reverse complemented if necessary. In some cases, this field may contain consensus sequences or other types of collapsed input sequences if these steps are performed prior to alignment.
sequence_aa string optional Amino acid translation of the query nucleotide sequence.
rev_comp boolean required True if the alignment is on the opposite strand (reverse complemented) with respect to the query sequence. If True then all output data, such as alignment coordinates and sequences, are based on the reverse complement of ‘sequence’.
productive boolean required True if the V(D)J sequence is predicted to be productive.
vj_in_frame boolean optional True if the V and J segment alignments are in-frame.
stop_codon boolean optional True if the aligned sequence contains a stop codon.
locus string optional Gene locus (chain type). For example, IGH, IGI, IGK, IGL, TRA, TRB, TRD, or TRG.
v_call string required V gene with allele. For example, IGHV4-59*01.
d_call string required D gene with allele. For example, IGHD3-10*01.
j_call string required J gene with allele. For example, IGHJ4*02.
c_call string optional C region gene with allele. For example, IGHM*01.
sequence_alignment string required Aligned portion of query sequence, including any indel corrections or numbering spacers, such as IMGT-gaps. Typically, this will include only the V(D)J region, but that is not a requirement.
sequence_alignment_aa string optional Amino acid translation of the aligned query sequence.
germline_alignment string required Assembled, aligned, full-length inferred germline sequence spanning the same region as the sequence_alignment field (typically the V(D)J region) and including the same set of corrections and spacers (if any).
germline_alignment_aa string optional Amino acid translation of the assembled germline sequence.
junction string required Junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons.
junction_aa string required Junction region amino acid sequence.
np1 string optional Nucleotide sequence of the combined N/P region between the V and D segments or V and J segments.
np1_aa string optional Amino acid translation of the np1 field.
np2 string optional Nucleotide sequence of the combined N/P region between the D and J segments.
np2_aa string optional Amino acid translation of the np2 field.
cdr1 string optional Nucleotide sequence of the aligned CDR1 region.
cdr1_aa string optional Amino acid translation of the cdr1 field.
cdr2 string optional Nucleotide sequence of the aligned CDR2 region.
cdr2_aa string optional Amino acid translation of the cdr2 field.
cdr3 string optional Nucleotide sequence of the aligned CDR3 region.
cdr3_aa string optional Amino acid translation of the cdr3 field.
fwr1 string optional Nucleotide sequence of the aligned FWR1 region.
fwr1_aa string optional Amino acid translation of the fwr1 field.
fwr2 string optional Nucleotide sequence of the aligned FWR2 region.
fwr2_aa string optional Amino acid translation of the fwr2 field.
fwr3 string optional Nucleotide sequence of the aligned FWR3 region.
fwr3_aa string optional Amino acid translation of the fwr3 field.
fwr4 string optional Nucleotide sequence of the aligned FWR4 region.
fwr4_aa string optional Amino acid translation of the fwr4 field.
v_score number optional Alignment score for the V gene.
v_identity number optional Fractional identity for the V gene alignment.
v_support number optional V gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the V gene assignment as defined by the alignment tool.
v_cigar string required CIGAR string for the V gene alignment.
d_score number optional Alignment score for the D gene alignment.
d_identity number optional Fractional identity for the D gene alignment.
d_support number optional D gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the D gene assignment as defined by the alignment tool.
d_cigar string required CIGAR string for the D gene alignment.
j_score number optional Alignment score for the J gene alignment.
j_identity number optional Fractional identity for the J gene alignment.
j_support number optional J gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the J gene assignment as defined by the alignment tool.
j_cigar string required CIGAR string for the J gene alignment.
c_score number optional Alignment score for the C gene alignment.
c_identity number optional Fractional identity for the C gene alignment.
c_support number optional C gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the C gene assignment as defined by the alignment tool.
c_cigar string optional CIGAR string for the C gene alignment.
v_sequence_start integer optional Start position of the V segment in the query sequence (1-based closed interval).
v_sequence_end integer optional End position of the V segment in the query sequence (1-based closed interval).
v_germline_start integer optional Alignment start position in the V gene reference sequence (1-based closed interval).
v_germline_end integer optional Alignment end position in the V gene reference sequence (1-based closed interval).
v_alignment_start integer optional Start position in the V segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
v_alignment_end integer optional End position in the V segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
d_sequence_start integer optional Start position of the D segment in the query sequence (1-based closed interval).
d_sequence_end integer optional End position of the D segment in the query sequence (1-based closed interval).
d_germline_start integer optional Alignment start position in the D gene reference sequence (1-based closed interval).
d_germline_end integer optional Alignment end position in the D gene reference sequence (1-based closed interval).
d_alignment_start integer optional Start position of the D segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
d_alignment_end integer optional End position of the D segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
j_sequence_start integer optional Start position of the J segment in the query sequence (1-based closed interval).
j_sequence_end integer optional End position of the J segment in the query sequence (1-based closed interval).
j_germline_start integer optional Alignment start position in the J gene reference sequence (1-based closed interval).
j_germline_end integer optional Alignment end position in the J gene reference sequence (1-based closed interval).
j_alignment_start integer optional Start position of the J segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
j_alignment_end integer optional End position of the J segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
cdr1_start integer optional CDR1 start position in the query sequence (1-based closed interval).
cdr1_end integer optional CDR1 end position in the query sequence (1-based closed interval).
cdr2_start integer optional CDR2 start position in the query sequence (1-based closed interval).
cdr2_end integer optional CDR2 end position in the query sequence (1-based closed interval).
cdr3_start integer optional CDR3 start position in the query sequence (1-based closed interval).
cdr3_end integer optional CDR3 end position in the query sequence (1-based closed interval).
fwr1_start integer optional FWR1 start position in the query sequence (1-based closed interval).
fwr1_end integer optional FWR1 end position in the query sequence (1-based closed interval).
fwr2_start integer optional FWR2 start position in the query sequence (1-based closed interval).
fwr2_end integer optional FWR2 end position in the query sequence (1-based closed interval).
fwr3_start integer optional FWR3 start position in the query sequence (1-based closed interval).
fwr3_end integer optional FWR3 end position in the query sequence (1-based closed interval).
fwr4_start integer optional FWR4 start position in the query sequence (1-based closed interval).
fwr4_end integer optional FWR4 end position in the query sequence (1-based closed interval).
v_sequence_alignment string optional Aligned portion of query sequence assigned to the V segment, including any indel corrections or numbering spacers.
v_sequence_alignment_aa string optional Amino acid translation of the v_sequence_alignment field.
d_sequence_alignment string optional Aligned portion of query sequence assigned to the D segment, including any indel corrections or numbering spacers.
d_sequence_alignment_aa string optional Amino acid translation of the d_sequence_alignment field.
j_sequence_alignment string optional Aligned portion of query sequence assigned to the J segment, including any indel corrections or numbering spacers.
j_sequence_alignment_aa string optional Amino acid translation of the j_sequence_alignment field.
c_sequence_alignment string optional Aligned portion of query sequence assigned to the constant region, including any indel corrections or numbering spacers.
c_sequence_alignment_aa string optional Amino acid translation of the c_sequence_alignment field.
v_germline_alignment string optional Aligned V gene germline sequence spanning the same region as the v_sequence_alignment field and including the same set of corrections and spacers (if any).
v_germline_alignment_aa string optional Amino acid translation of the v_germline_alignment field.
d_germline_alignment string optional Aligned D gene germline sequence spanning the same region as the d_sequence_alignment field and including the same set of corrections and spacers (if any).
d_germline_alignment_aa string optional Amino acid translation of the d_germline_alignment field.
j_germline_alignment string optional Aligned J gene germline sequence spanning the same region as the j_sequence_alignment field and including the same set of corrections and spacers (if any).
j_germline_alignment_aa string optional Amino acid translation of the j_germline_alignment field.
c_germline_alignment string optional Aligned constant region germline sequence spanning the same region as the c_sequence_alignment field and including the same set of corrections and spacers (if any).
c_germline_alignment_aa string optional Amino acid translation of the c_germline_aligment field.
junction_length integer optional Number of nucleotides in the junction sequence.
np1_length integer optional Number of nucleotides between the V and D segments or V and J segments.
np2_length integer optional Number of nucleotides between the D and J segments.
n1_length integer optional Number of untemplated nucleotides 5’ of the D segment.
n2_length integer optional Number of untemplated nucleotides 3’ of the D segment.
p3v_length integer optional Number of palindromic nucleotides 3’ of the V segment.
p5d_length integer optional Number of palindromic nucleotides 5’ of the D segment.
p3d_length integer optional Number of palindromic nucleotides 3’ of the D segment.
p5j_length integer optional Number of palindromic nucleotides 5’ of the J segment.
consensus_count integer optional Number of reads contributing to the (UMI) consensus for this sequence. For example, the sum of the number of reads for all UMIs that contribute to the query sequence.
duplicate_count integer optional Copy number or number of duplicate observations for the query sequence. For example, the number of UMIs sharing an identical sequence or the number of identical observations of this sequence absent UMIs.
pair_id string optional Valid sequence_id that was determined by experimental or computational means to be associated with the current Rearrangement on the cellular level.
cell_id string optional Identifier defining the cell of origin for the query sequence.
clone_id string optional Clonal cluster assignment for the query sequence.
rearrangement_id string optional Identifier for the Rearrangement object. May be identical to sequence_id, but will usually be a univerally unique record locator for database applications.
repertoire_id string optional Identifier to the associated repertoire in study metadata.
data_processing_id string optional Identifier to the data processing object in the repertoire metadata for this rearrangement. If this field is empty than the primary data processing object is assumed.
germline_database string optional Source of germline V(D)J genes with version number or date accessed.