Germline Schema (Experimental)#
Motivation#
Understanding and cataloguing receptor germline genes and allele sequences is critical to the analysis of AIRR data. While the human set is relatively well understood in outline, although probably still far from complete, those of other species, even those that are relatively closely studied, is at a much earlier stage. There is an urgent need to define a standardised format for listing such genes, so that they can be shared between researchers and easily consumed by software tools.
Receptor Germline Schema#
The receptor germline schema defines the data elements necessary to describe one or more receptor germline genes, together
with supporting evidence. The fundamental object is the AlleleDescription
, which describes a single gene or allele, containing
the necessary details for the annotation of a rearranged sequence such as the location of CDRs (in the case of a V-gene) and
framing information (in the case of a J-gene). AlleleDescription
also contains fields to delineate RSS, and the leader regions
of V-genes, should those be covered by the sequence provided.
Evidence supporting the gene or allele can be provided in linked UnrearrangedSequence
and RearrangedSequence
objects. Information
represented in these objects will typically be stored in a repository: either an INSDC repository such as Genbank or SRA, or
a lower-tier repository such as OGRDB. Please note that the key distinction between these object types is whether the V(D)J
genes have rearranged, rather than the origin of the material, as mature B and T cells carry rearranged sequences in chromosomal
DNA. It is most likely that supporting sequences will be UnrearrangedSequences
, i.e. prior to rearrangement. In the case of a
germline inference from a repertoire, the inferred germline sequence should be provided as a RearrangedSequence
, if the evidence
has been deposited in a repository.
For V-genes, an IMGT-gapped sequence (i.e.,. a sequence delineated in accordance with the
IMGT numbering scheme) is provided in
AlleleDescription
. Other delineations, such as Chothia and
Kabat, can be provided via linked SequenceDelineationV
objects.
A GermlineSet
brings together multiple AlleleDescriptions
from the same locus to form a curated set. The schema assumes that germline
sets will be published by multiple repositories. A germline set may be uniquely referenced by means of the germline_set_ref
, which
is a composite field containing the repository id, germline set label, and version.
Gene and Allele Naming#
AlleleDescription
contains a label
field, which should contain the accepted name for the field, as determined by the authors/curators
of the record. The Nomenclature Committee of the International Union of Immunological Societies (IUIS) allocates gene symbols for receptor genes, and, if a gene symbol has been
allocated, this should be used as the label. Where a gene symbol has not been allocated (for example, because the gene or allele has only
recently been discovered, or because the available evidence does not meet IUIS standards, a ‘temporary label’ should be used. It is anticipated
that publishers of gene sets will provide mechanisms to issue these temporary labels, and to allow researchers to review change history of
AlleleDescriptions
and GermlineSets
. To provide consistency across research groups, the
Germline Database Working Group of the AIRR Community is
developing a community-wide approach to the allocation of temporary labels.
Genotypes#
A GenotypeSet
describes the specific receptor alleles found in a subject, and also identifies genes that are not found (this could be either
because they are not present in the chromosomal locus, or because they are not expressed or expressed only at low levels).
Depending on the data available and the inference method used, genotypes may contain haplotyping information, which may be full, or partial.
As an example of partial haplotyping, the genotype may have been determined from genomic sequencing in which the sequence of the locus was
assembled into contigs, but could not be fully assembled. In this case the co-location of alleles in each contig has been established, but
the co-location across the entire locus can not be. Co-location is therefore indicated by means of the phasing
parameter, which in this
case would be assigned a different value for alleles on each contig.
MHC Genotypes#
Similary to the IG/TR genotypes, the MHCGenotype
amd MHCGenotypeSet
objects describe the MHC alleles found in a subject. MHCGenotype
objects
assemble alleles from one class: MHC-I
, MHC-II
or MHC-nonclassical
.
The method used to determine the genotype can be provided in the
mhc_genotyping_method
field. As different methods might be use for the
various classes, this field is located in the MHCGenotype object, not the
MHCGenotypeSet
.
The mhc_genotyping_method
allows free-text descriptions, however data
curators are asked to keep close to the following terms if applicable:
PCR-based typing
: Methods whose read-out is the amplification of specific sequences, but which do not provide sequence data by themselves. This includes SSP and SSOP.Sequencing-based typing
: Clinical-grade NGS-based assays, providing high quality and resolution.Inference-based typing
: Allele inferrence based on genome-wide DNA or RNA sequencing.
File Format Specification#
Files are YAML/JSON with a structure defined below. Files should be
encoded as UTF-8. Identifiers are case-sensitive. Files should have the
extension .yaml
, .yml
, or .json
.
Germline Set File Structure#
The Germline Set file has a standardised structure that is utilized by all top-level AIRR Schema Objects and defined by
the DataFile
schema. It is intended to contan all information necessary to annotate receptor sequences derived from a single germline
locus, and to be directly usable by annotation tools and other processing software.
The file must contain YAML or JSON representation of one or more GermlineSet
objects, including the associated AlleleDescription
objects. It may optionally
include other associated objects: SequenceDelineationV
, RearrangedSequence
, UnrearrangedSequence
, Acknowledgement
. These should all be embedded into the
overall GermlineSet
as specified in the schema.
The file as a whole is considered a dictionary (key/value pair) structure with the keys
Info
,GermlineSet
, andAlleleDescription
.The
GermlineSet
contains fieldsrelease_version
,release_description
andrelease_date
, which are intended to be used for version identification, under the control of the authors of theGermlineSet
as identified by the fieldsauthor
,lab_name
andlab_address
. If the set is modified by a party other than these authors, that these 6 fields should be modified to reflect the authors of the modification, and their own version identication. These modifications MUST be made if theGermlineSet
is, or is likely to become, public, in order to avoid confusion with the original set prior to modification. Repositories are encouraged to manage version fields automatically.The file can (optionally) contain an
Info
object, at the beginning of the file, based upon theInfo
schema in the OpenAPI specification. If provided,version
inInfo
should reference the version of the AIRR schema for the file.The file should correspond to a list of
GermlineSet
objects, usingGermlineSet
as the key to the list.The file should correspond to a list of
AlleleDescription
objects, usingAlleleDescription
as the key to the list.There should be only one
AlleleDescription
for each allele in the list.Each
AlleleDescription
object should contain a top-level key/value pair forallele_description_id
that uniquely identifies the allele description object in the file.Each
GermlineSet
object should contain a top-level key/value pair forgermline_set_id
that uniquely identifies the germline set object in the file.Some fields require the use of a particular ontology or controlled vocabulary.
GermlineSet
andAlleleDescription
contain reference fieldsgermline_set_ref
andallele_description_ref
. These are intended to be globally unique references (containing identifiers of the repository, object and version) that can be used in a query API.The structure is the same regardless of whether the data is stored in a file or retrieved from a data repository. For example, The ADC API will return a properly structured JSON object that can be saved to a file and used directly without modification.
GermlineSet Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
required, identifier, nullable |
Unique identifier of the GermlineSet within this file. Typically, generated by the repository hosting the record. |
|
array of Contributor |
required, nullable |
List of individuals whose contribution to the germline set should be acknowledged. Note that these are not necessarily identical with the authors on an associated manuscript or other scholarly communication. Further note that typically at least the three CRediT contributor roles “supervision”, “investigation” and “data curation” should be assigned. The coresponding author should be listed last. |
|
number |
required, nullable |
Version number of this record, allocated automatically |
|
string |
required, nullable |
Brief descriptive notes of the reason for this release and the changes embodied |
|
string |
required, nullable |
Date of this release |
|
string |
required, nullable |
descriptive name of this germline set |
|
string |
required, nullable |
Unique identifier of the germline set and version, in standardized form (Repo:Label:Version) |
|
array of string |
optional, nullable |
Publications describing the germline set |
|
required |
Binomial designation of subject’s species |
|
|
string |
optional, nullable |
Race, strain or other species subgroup to which this subject belongs |
|
string |
optional, nullable |
|
|
string |
required |
Gene locus |
|
array of AlleleDescription |
required, nullable |
list of allele_descriptions in the germline set |
|
string |
optional, nullable |
Curational notes on the GermlineSet. This can be used to give more extensive notes on the decisions taken than are provided in the release_description. |
AlleleDescription Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
required, identifier, nullable |
Unique identifier of this AlleleDescription within the file. Typically, generated by the repository hosting the record. |
|
string |
optional, nullable |
Unique reference to the allele description, in standardized form (Repo:Label:Version) |
|
array of Contributor |
required, nullable |
List of individuals whose contribution to the gene description should be acknowledged. Note that these are not necessarily identical with the authors on an associated manuscript or other scholarly communication. Further note that typically at least the three CRediT contributor roles “supervision”, “investigation” and “data curation” should be assigned. The current maintainer should be listed first. |
|
integer |
required, nullable |
Version number of this record, updated whenever a revised version is published or released |
|
string |
required, nullable |
Date of this release |
|
string |
required, nullable |
Brief descriptive notes of the reason for this release and the changes embodied |
|
string |
optional, nullable |
The accepted name for this gene or allele following the relevant nomenclature. The value in this field should correspond to values in acceptable name fields of other schemas, such as v_call, d_call, and j_call fields. |
|
string |
required |
Nucleotide sequence of the gene. This should cover the full length that is available, including where possible RSS, and 5’ UTR and lead-in for V-gene sequences. |
|
string |
required, nullable |
Nucleotide sequence of the core coding region, such as the coding region of a D-, J- or C- gene or the coding region of a V-gene excluding the leader. |
|
array of string |
optional, nullable |
Alternative names for this sequence |
|
string |
required |
Gene locus |
|
integer |
optional, nullable |
chromosome on which the gene is located |
|
string |
required |
Sequence type (V, D, J, C) |
|
boolean |
required, nullable |
True if the gene is functional, false if it is a pseudogene |
|
string |
required, nullable |
Type of inference(s) from which this gene sequence was inferred |
|
required |
Binomial designation of subject’s species |
|
|
string |
optional, nullable |
Race, strain or other species subgroup to which this subject belongs |
|
string |
optional, nullable |
|
|
string |
optional, nullable |
Status of record, assumed active if the field is not present |
|
string |
optional, nullable |
Identifier of the gene subgroup or clade, as (and if) defined |
|
string |
optional, nullable |
Gene number or other identifier, as (and if) defined |
|
string |
optional, nullable |
Allele number or other identifier, as (and if) defined |
|
string |
optional, nullable |
ID of the similarity cluster used in this germline set, if designated |
|
string |
optional, nullable |
Membership ID of the allele within the similarity cluster, if a cluster is designated |
|
integer |
optional, nullable |
Codon position of the first nucleotide in the ‘coding_sequence’ field. Mandatory for J genes. Not used for V or D genes. ‘1’ means the sequence is in-frame, ‘2’ means that the first bp is missing from the first codon, and ‘3’ means that the first 2 bp are missing. |
|
integer |
optional, nullable |
Co-ordinate in the sequence field of the first nucleotide in the coding_sequence field. |
|
integer |
optional, nullable |
Co-ordinate in the sequence field of the last gene-coding nucleotide in the coding_sequence field. |
|
integer |
optional, nullable |
Start co-ordinate in the sequence field of the 5 prime UTR (V-genes only). |
|
integer |
optional, nullable |
End co-ordinate in the sequence field of the 5 prime UTR (V-genes only). |
|
integer |
optional, nullable |
Start co-ordinate in the sequence field of L-PART1 (V-genes only). |
|
integer |
optional, nullable |
End co-ordinate in the sequence field of L-PART1 (V-genes only). |
|
integer |
optional, nullable |
Start co-ordinate in the sequence field of L-PART2 (V-genes only). |
|
integer |
optional, nullable |
End co-ordinate in the sequence field of L-PART2 (V-genes only). |
|
integer |
optional, nullable |
Start co-ordinate in the sequence field of the V recombination site (V-genes only). |
|
integer |
optional, nullable |
End co-ordinate in the sequence field of the V recombination site (V-genes only). |
|
integer |
optional, nullable |
Start co-ordinate in the sequence field of the 3 prime D recombination site (D-genes only). |
|
integer |
optional, nullable |
End co-ordinate in the sequence field of the 3 prime D recombination site (D-genes only). |
|
integer |
optional, nullable |
Start co-ordinate in the sequence field of the 5 prime D recombination site (D-genes only). |
|
integer |
optional, nullable |
End co-ordinate in the sequence field of 5 the prime D recombination site (D-genes only). |
|
integer |
optional, nullable |
In the case of a J-gene, the co-ordinate in the sequence field of the first nucelotide of the conserved PHE or TRP (IMGT codon position 118). |
|
integer |
optional, nullable |
Start co-ordinate in the sequence field of J recombination site (J-genes only). |
|
integer |
optional, nullable |
End co-ordinate in the sequence field of J recombination site (J-genes only). |
|
integer |
optional, nullable |
Co-ordinate in the sequence field of the final 3’ nucleotide of the J-REGION (J-genes only). |
|
array of SequenceDelineationV |
optional, nullable |
|
|
array of UnrearrangedSequence |
optional, nullable |
|
|
array of RearrangedSequence |
optional, nullable |
|
|
array of string |
optional, nullable |
Gene symbols of any paralogs |
|
string |
optional, nullable |
Curational notes on the AlleleDescription. This can be used to give more extensive notes on the decisions taken than are provided in the release_description. |
|
array of string |
optional, nullable |
Controlled-vocabulary tags applied to this description |
RearrangedSequence Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
required, identifier, nullable |
Unique identifier of this RearrangedSequence within the file, typically generated by the repository hosting the schema, for example from the underlying ID of the database record. |
|
string |
required |
nucleotide sequence |
|
string |
required, nullable |
The class of nucleic acid that was used as primary starting material |
|
string |
required |
The type of observation from which this sequence was drawn, such as direct sequencing or inference from repertoire sequencing data. |
|
string |
optional, nullable |
Curational notes on the sequence |
|
string |
required, nullable |
Name of the repository in which the sequence has been deposited |
|
string |
optional, nullable |
Queryable id or accession number of the sequence published by the repository |
|
string |
required, nullable |
Version number of the sequence within the repository |
|
integer |
optional |
Start co-ordinate of the sequence detailed in this record, within the sequence deposited |
|
integer |
optional |
End co-ordinate of the sequence detailed in this record, within the sequence deposited |
UnrearrangedSequence Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
required, identifier, nullable |
unique identifier of this UnrearrangedSequence within the file |
|
string |
required |
Sequence of interest described in this record. Typically, this will include gene and promoter region. |
|
string |
optional, nullable |
Curational notes on the sequence |
|
string |
required, nullable |
Name of the repository in which the assembly or contig is deposited |
|
string |
optional, nullable |
Queryable id or accession number of the sequence published by the repository |
|
string |
optional, nullable |
Genome assembly patch number in which this gene was determined |
|
string |
required, nullable |
Sequence (from the assembly) of a window including the gene and preferably also the promoter region. |
|
integer |
required, nullable |
Genomic co-ordinates of the start of the sequence of interest described in this record in Ensemble GFF version 3. |
|
integer |
required, nullable |
Genomic co-ordinates of the end of the sequence of interest described in this record in Ensemble GFF version 3. |
|
string |
required, nullable |
sense (+ or -) |
SequenceDelineationV Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
required, identifier, nullable |
Unique identifier of this SequenceDelineationV within the file. Typically, generated by the repository hosting the record. |
|
string |
required, nullable |
Name of the delineation scheme |
|
string |
optional, nullable |
entire V-sequence covered by this delineation |
|
string |
optional, nullable |
Aligned sequence if this delineation provides an alignment. An aligned sequence should always be provided for IMGT delineations. |
|
integer |
required, nullable |
FWR1 start co-ordinate in the ‘unaligned sequence’ field |
|
integer |
required, nullable |
FWR1 end co-ordinate in the ‘unaligned sequence’ field |
|
integer |
required, nullable |
CDR1 start co-ordinate in the ‘unaligned sequence’ field |
|
integer |
required, nullable |
CDR1 end co-ordinate in the ‘unaligned sequence’ field |
|
integer |
required, nullable |
FWR2 start co-ordinate in the ‘unaligned sequence’ field |
|
integer |
required, nullable |
FWR2 end co-ordinate in the ‘unaligned sequence’ field |
|
integer |
required, nullable |
CDR2 start co-ordinate in the ‘unaligned sequence’ field |
|
integer |
required, nullable |
CDR2 end co-ordinate in the ‘unaligned sequence’ field |
|
integer |
required, nullable |
FWR3 start co-ordinate in the ‘unaligned sequence’ field |
|
integer |
required, nullable |
FWR3 end co-ordinate in the ‘unaligned sequence’ field |
|
integer |
required, nullable |
CDR3 start co-ordinate in the ‘unaligned sequence’ field |
|
array of string |
optional, nullable |
One string for each codon in the aligned_sequence indicating the label of that codon according to the numbering of the delineation scheme if it provides one. |
GenotypeSet Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
required, identifier, nullable |
A unique identifier for this Receptor Genotype Set, typically generated by the repository hosting the schema, for example from the underlying ID of the database record. |
|
array of Genotype |
optional, nullable |
List of Genotypes included in this Receptor Genotype Set. |
Genotype Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
required, identifier, nullable |
A unique identifier within the file for this Receptor Genotype, typically generated by the repository hosting the schema, for example from the underlying ID of the database record. |
|
string |
required |
Gene locus |
|
array of DocumentedAllele |
optional, nullable |
List of alleles documented in reference set(s) |
|
array of UndocumentedAllele |
optional, nullable |
List of alleles inferred to be present and not documented in an identified GermlineSet |
|
array of DeletedGene |
optional, nullable |
Array of genes identified as being deleted in this genotype |
|
string |
optional, nullable |
Information on how the genotype was acquired. Controlled vocabulary. |
MHCGenotypeSet Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
required, identifier, nullable |
A unique identifier for this MHCGenotypeSet |
|
array of MHCGenotype |
required, nullable |
List of MHCGenotypes included in this set |
MHCGenotype Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
required, identifier, nullable |
A unique identifier for this MHCGenotype, assumed to be unique in the context of the study |
|
string |
required |
Class of MHC alleles described by the MHCGenotype |
|
array of MHCAllele |
required, nullable |
List of MHC alleles of the indicated mhc_class identified in an individual |
|
string |
optional, nullable |
Information on how the genotype was determined. The content of this field should come from a list of recommended terms provided in the AIRR Schema documentation. |