Germline Schema (Experimental)

Motivation

Understanding and cataloguing receptor germline genes and allele sequences is critical to the analysis of AIRR data. While the human set is relatively well understood in outline, although probably still far from complete, those of other species, even those that are relatively closely studied, is at a much earlier stage. There is an urgent need to define a standardised format for listing such genes, so that they can be shared between researchers and easily consumed by software tools.

Receptor Germline Schema

The receptor germline schema defines the data elements necessary to describe one or more receptor germline genes, together with supporting evidence. The fundamental object is the GeneDescription, which describes a single gene or allele, containing the necessary details for the annotation of a rearranged sequence such as the location of CDRs (in the case of a V-gene) and framing information (in the case of a J-gene). GeneDescription also contains fields to delineate RSS, and the leader regions of V-genes, should those be covered by the sequence provided.

Evidence supporting the gene or allele can be provided in linked GermlineSequence and RearrangedSequence objects. Information represented in these objects will typically be stored in a repository: either an INSDC repository such as Genbank or SRA, or a lower-tier repository such as OGRDB. Please note that the key distinction between these object types is whether the V(D)J genes have rearranged, rather than the origin of the material, as mature B and T cells carry rearranged sequences in chromosomal DNA. It is most likely that supporting sequences will be GermlineSequences, i.e. prior to rearrangement. In the case of a germline inference from a repertoire, the inferred germline sequence should be provided as a GermlineSequence, if the evidence has been deposited in a repository.

For V-genes, an IMGT-gapped sequence (i.e.,. a sequence delineated in accordance with the IMGT numbering scheme) is provided in GeneDescription. Other delineations, such as Chothia and Kabat, can be provided via linked GeneDelineationV objects. A GermlineSet brings together multiple GeneDescriptions from the same locus to form a curated set. The schema assumes that germline sets will be published by multiple repositories. A germline set may be uniquely referenced by means of the germline_set_ref: this is a composite field containing the repository id, germline set label, and version.

Gene and Allele Naming

The International Union of Immunological Societies allocates gene symbols for receptor genes. GeneDescription contains a gene_symbol field, but it is optional, recognising that a symbol may not have been assigned. Gene symbols are long-lasting, but the underlying sequence may be revised over time. GeneDescription contains a mandatory coding_sequence_identifier, which will be updated should the sequence change. It is anticipated that publishers of gene sets will provide mechanisms to issue these identifiers, and to allow researchers to review change history of GeneDescriptions and GermlineSets. In the interests of consistency and transparency, when referring to a gene or allele, the gene_symbol should be used wherever possible, however coding_sequence_identifier provides a fallback where a gene symbol has not been assigned.

Genotypes

A ReceptorGenotype describes the specific alleles found in an individual, and also identifies genes that are not found (deleted). Depending on the data available and the inference method used, genotypes may contain haplotyping information, which may be full, or partial. As an example of partial haplotyping, the genotype may have been determined from genomic sequencing in which the sequence of the locus was assembled into contigs, but could not be fully assembled. In this case the co-location of alleles in each contig has been established, but the co-location across the entire locus can not be. Co-location is therefore indicated by means of the phasing parameter, which in this case would be assigned a different value for alleles on each contig.

File Format Specification

The file format has not been specified yet.

GermlineSet Fields

Download as TSV

Name

Type

Attributes

Definition

germline_set_id

string

required

Unique identifier of the GermlineSet within this file

author

string

required

Corresponding author

lab_name

string

required

Department of corresponding author

lab_address

string

required

Institutional address of corresponding author

acknowledgements

array of Acknowledgement

optional, nullable

List of individuals whose contribution to the germline set should be acknowledged

release_version

number

required

Version number of this record, allocated automatically

release_description

string

required

Brief descriptive notes of the reason for this release and the changes embodied

release_date

string

required

Date of this release

germline_set_name

string

required

descriptive name of this germline set

germline_set_ref

string

required

Unique identifier of the germline set and version, in standardized form (Repo:Label:Version)

pub_ids

string

optional, nullable

Publications describing the germline set

species

string

required

Binomial designation of subject’s species

species_subgroup

string

optional, nullable

Race, strain or other species subgroup to which this subject belongs

species_subgroup_type

string

optional, nullable

locus

string

required

Gene locus

gene_descriptions

array of GeneDescription

required

list of gene_descriptions in the germline set

notes

string

optional, nullable

Notes

GeneDescription Fields

Download as TSV

Name

Type

Attributes

Definition

gene_description_id

string

required

Unique identifier of this GeneDescription within the file

maintainer

string

required

Maintainer of this sequence record

acknowledgements

array of Acknowledgement

optional, nullable

List of individuals whose contribution to the gene description should be acknowledged

lab_address

string

required

Institution and full address of corresponding author

release_version

integer

required

Version number of this record, updated whenever a revised version is published or released

release_date

string

required

Date of this release

release_description

string

required

Brief descriptive notes of the reason for this release and the changes embodied

gene_symbol

string

optional, nullable

The accepted name for this gene or allele, if any

sequence

string

required

nt sequence of the gene. This should cover the full length that is available, including where possible RSS, and 5’ UTR and lead-in for V-gene sequences

coding_sequence

string

required

nucleotide sequence of the core region of the gene (V-, D-, J- or C-REGION), aligned, in the case of the V-REGION, with the IMGT numbering scheme

coding_sequence_identifier

string

required

Unique identifier of the coding_sequence, as allocated by an identified repository

alt_names

array of string

optional, nullable

Alternative names for this sequence

locus

string

required

Gene locus

chromosome

integer

optional, nullable

chromosome on which the gene is located

sequence_type

string

required

Sequence type (V, D, J, C)

functional

boolean

required

True if the gene is functional, false if it is a pseudogene

inference_type

string

required

Type of inference(s) from which this gene sequence was inferred

species

string

required

Binomial designation of subject’s species

species_subgroup

string

optional, nullable

Race, strain or other species subgroup to which this subject belongs

species_subgroup_type

string

optional, nullable

status

string

optional, nullable

Status of record, assumed active if the field is not proesent

gene_subgroup

string

optional, nullable

Gene subgroup or clade, as (and if) identified for this species and gene

subgroup_designation

string

optional, nullable

Gene designation within this subgroup, if identified

allele_designation

string

optional, nullable

Allele designation, if identified

j_codon_frame

integer

optional, nullable

Codon position of the first nucleotide in the ‘coding_sequence’ field. Mandatory for J genes. Not used for V or D genes. (‘1’ means the sequence is in-frame, ‘2’ means that the first bp is missing from the first codon, ‘3’ means that the first 2 bp are missing)

gene_start

integer

optional, nullable

Co-ordinate (in the sequence field) of the first nucleotide in the coding_sequence field

gene_end

integer

optional, nullable

Co-ordinate (in the sequence field) of the last gene-coding nucleotide in the coding_sequence field

utr_5_prime_start

integer

optional, nullable

Start co-ordinate (in the sequence field) of 5 prime UTR (V-genes only)

utr_5_prime_end

integer

optional, nullable

End co-ordinate (in the sequence field) of 5 prime UTR (V-genes only)

leader_1_start

integer

optional, nullable

Start co-ordinate (in the sequence field) of L-PART1 (V-genes only)

leader_1_end

integer

optional, nullable

End co-ordinate (in the sequence field) of L-PART1 (V-genes only)

leader_2_start

integer

optional, nullable

Start co-ordinate (in the sequence field) of L-PART2 (V-genes only)

leader_2_end

integer

optional, nullable

End co-ordinate (in the sequence field) of L-PART2 (V-genes only)

v_rs_start

integer

optional, nullable

Start co-ordinate (in the sequence field) of V recombination site (V-genes only)

v_rs_end

integer

optional, nullable

End co-ordinate (in the sequence field) of V recombination site (V-genes only)

d_rs_3_prime_start

integer

optional, nullable

Start co-ordinate (in the sequence field) of 3 prime D recombination site (D-genes only)

d_rs_3_prime_end

integer

optional, nullable

End co-ordinate (in the sequence field) of 3 prime D recombination site (D-genes only)

d_rs_5_prime_start

integer

optional, nullable

Start co-ordinate (in the sequence field) of 5 prime D recombination site (D-genes only)

d_rs_5_prime_end

integer

optional, nullable

End co-ordinate (in the sequence field) of 5 prime D recombination site (D-genes only)

j_cdr3_end

integer

optional, nullable

In the case of a J-gene, the co-ordinate (in the sequence field) of the first nucelotide of the conserved PHE or TRP (IMGT codon position 118)

j_rs_start

integer

optional, nullable

Start co-ordinate (in the sequence field) of J recombination site (J-genes only)

j_rs_end

integer

optional, nullable

End co-ordinate (in the sequence field) of J recombination site (J-genes only)

j_donor_splice

integer

optional, nullable

Co-ordinate (in the sequence field) of the 3’ splice donor site (J-genes only)

v_gene_delineations

array of GeneDelineationV

optional, nullable

genomic_support

array of GermlineSequence

optional, nullable

rearranged_support

array of RearrangedSequence

optional, nullable

paralogs

array of string

optional, nullable

Gene symbols of any paralogs

notes

string

optional, nullable

Notes

curational_tags

array of string

optional, nullable

Controlled-vocabulary tags applied to this description

RearrangedSequence Fields

Download as TSV

Name

Type

Attributes

Definition

sequence_id

string

required

Unique identifier of this RearrangedSequence within the file

sequence

string

required

nucleotide sequence

derivation

string

required

The class of nucleic acid that was used as primary starting material

observation_type

string

required

The type of observation from which this sequence was drawn, e.g. direct sequencing, inference from repertoire

notes

string

optional, nullable

Notes

repository_name

string

required

Name of the repository in which the sequence has been deposited

repository_id

string

required

Id or serial number of the sequence within the repository

deposited_version

string

required

Version number of the sequence within the repository

seq_start

integer

required

Start co-ordinate of the sequence detailed in this record, within the sequence deposited

seq_end

integer

required

End co-ordinate of the sequence detailed in this record, within the sequence deposited

GermlineSequence Fields

Download as TSV

Name

Type

Attributes

Definition

sequence_id

string

required

unique identifier of this GermlineSequence within the file

sequence

string

required

Sequence of interest described in this record (typically this will include gene and promoter region)

notes

string

optional, nullable

Notes

repository_name

string

required

Name of the repository in which the assembly or contig is deposited

assembly_id

string

required

Identifier of the assembly or contig within the repository

patch_no

string

optional, nullable

Genome assembly patch number in which this gene was determined

gff_seqid

string

required, nullable

Germline sequence (from the assembly) of a window including the gene and preferably also the promoter region

gff_start

integer

required, nullable

Genomic co-ordinates of the start of the sequence of interest described in this record, in Ensemble GFF version 3

gff_end

integer

required, nullable

Genomic co-ordinates of the end of the sequence of interest described in this record, in Ensemble GFF version 3

strand

string

required, nullable

sense (+ or -)

GeneDelineationV Fields

Download as TSV

Name

Type

Attributes

Definition

germline_delineation_id

string

required

Unique identifier of this GeneDelineationV within the file

delineation_scheme

string

required

Name of the delineation scheme

fwr1_start

integer

required

FWR1 start co-ordinate in Gene Description ‘alignment’ field

fwr1_end

integer

required

FWR1 end co-ordinate in Gene Description ‘alignment’ field

cdr1_start

integer

required

CDR1 start co-ordinate in Gene Description ‘alignment’ field

cdr1_end

integer

required

CDR1 end co-ordinate in Gene Description ‘alignment’ field

fwr2_start

integer

required

FWR2 start co-ordinate in Gene Description ‘alignment’ field

fwr2_end

integer

required

FWR2 end co-ordinate in Gene Description ‘alignment’ field

cdr2_start

integer

required

CDR2 start co-ordinate in Gene Description ‘alignment’ field

cdr2_end

integer

required

CDR2 end co-ordinate in Gene Description ‘alignment’ field

fwr3_start

integer

required

FWR3 start co-ordinate in Gene Description ‘alignment’ field

fwr3_end

integer

required

FWR3 end co-ordinate in Gene Description ‘alignment’ field

cdr3_start

integer

required

CDR3 start co-ordinate in Gene Description ‘alignment’ field

alignment

array of string

optional, nullable

one string for each codon in the fields v_start to cdr3_start indicating the label of that codon according to the numbering of the delineation scheme

ReceptorGenotype Fields

Download as TSV

MHCGenotype Fields

Download as TSV

Name

Type

Attributes

Definition

mhc_genotype_id

string

required

A unique identifier for this MHC Genotype, assumed to be unique in the context of the study.

genotype_class

string

optional, nullable

germline_alleles

array of object

optional, nullable

Array of gene descriptions

genotype_process

string

optional, nullable

Information on how the genotype was acquired. Controlled vocabulary.