MiAIRR-to-NCBI Specification#
Outline of INSDC reporting procedure#
TODO: Outline the reporting procedure for data sets 1-4
In terms of standard compliance it is currently REQUIRED [1] to deposit information for MiAIRR data sets 5 and 6 in general-purpose sequence repositories for which an AIRR-accepted specification on information mapping MUST exist. However, users should note that in the future additional AIRR-sanctioned mechanisms for data deposition will become available as specified by the AIRR Common Repository Working Group. The mapping of data items in MiAIRR data sets 5 and 6 differs substantially in size and structure and therefore requires distinct reporting procedures:
Set 5: This is free text information describing the work flow, tools and parameters of the sequence read processing. It is REQUIRED that this information is deposited as a freely available document, permanently linked via a DOI. Note that is currently neither a specific format for this document nor a recommended service provider for obtaining the DOI.
Set 6: This is specified to contain the consensus sequence and the following information obtained from the initial analysis: V, D and J segment, C region and IMGT-JUNCTION [2] [LIGMDB_V12]. These will be deposited in a general-purpose INSDC repository, using the record structure described below.
INSDC records were originally designed to hold individual Sanger sequences. Therefore each record will contain a header with information largely identical between all records in an AIRR sequencing study. Records can be concatenated for uploading.
The INSDC feature table (FT) [INSDC_FT] is a sequence annotation standard used within the INSDC records and assigns information to specified positions on the reported sequence string. In regard to the correct location of the provided annotation, it should especially be noted that some V(D)J inference tools will return coordinates referring to the reference instead of the query sequence. As the sequence submitted in a record MUST be identical to the query sequence, the positions provided by the V(D)J inference tool MUST, if necessary, be translated back onto the query sequence. In case the start and/or end of a feature cannot be reliably determined or is not present in the reported sequence [3], open intervals CAN be used for reporting. However, open intervals MUST NOT be used to deliberately obfuscate known positions.
In addition to the required information specified in Table_1, users CAN use all valid FT keys/qualifiers to provide further annotation for the reported sequences. However, a record MUST still be compliant with this specification, if such OPTIONAL information would be removed, meaning that it is FORBIDDEN to move REQUIRED information into OPTIONAL keys/qualifiers. In addition, users MUST NOT use keys/qualifiers that could create ambiguity with the keys/qualifiers specified here.
element |
FT key |
FT qualifier |
FT value |
REQUIRED (if used by original study) |
---|---|---|---|---|
V segment |
|
|
see [Feature table] |
yes |
D segment |
|
|
see [Feature table] |
yes; if IGH, TRB or TRD sequence |
J segment |
|
|
see [Feature table] |
yes |
C region |
|
|
see [Feature table] |
yes |
JUNCTION |
|
|
“JUNCTION” |
yes |
Table 1: Summary of the mapping of mandatory AIRR MiniStd data set 6 elements to the INSDC feature table (FT). Note that the overall record will contain additional information, such as cross-references linking the deposited sequence reads and metadata.
Element mapping#
The broad strategy of element mapping to the various repositories is depicted in Table_2.
MiAIRR data set / subset |
target repository |
---|---|
1 / study |
BioProject |
1 / subject |
|
1 / diagnosis & treatment |
|
2 / sample |
BioSample |
3 / processing (cells) |
|
3 / processing (nucleic acids) |
SRA |
4 / raw sequences |
|
5 / processing (data) |
user-defined DOI |
6 / Processed sequences & annotations |
Genbank |
Table 2: Summary of the mapping of MiAIRR data sets to the various repositories
Mapping of data sets 1-4 to BioProject/BioSample/SRA#
TODO: Include item-by-item mapping [NCBI_NBK47528]
Mapping of data set 5 to a user-defined repository#
While several mandatory item have been defined in this data set, there
is currently no mapping as the reporting procedure is implemented as a
free text document. AIRR RECOMMENDS to use Zenodo for deposition of
these documents, as it is hosted by CERN and supports versioned DOIs
(termed “concept” DOI). Users SHOULD use the existing AIRR
tag
when submitting documents to increase the visiblity of their study.
Mapping of data set 6 to INSDC#
Users should note that while the FT is standardized, the overall sequence record structure diverges between the three INSDC repositories. The following section refers to items at or above the hierarchy level of the FT using the GenBank specification [GENBANK_FF], the corresponding designations of ENA [ENA_MANUAL] are provided in parenthesis [11].
Record header#
The header MUST contain all of the following elements:
REQUIRED: header structure as specified by the respective INSDC repository [ENA_MANUAL] [GENBANK_FF] [GENBANK_SR].
FORBIDDEN: The
DEFINITION
entry will be autopopulated by information provided in the FT part (misc_feature
,/note
).REQUIRED: identifier of the associated SRA record (MiAIRR data set 4) as
DBLINK
(ENA:DR
line). Note that it is not possible to refer to individual raw reads, only the full SRA collections can be linked.REQUIRED: in the
KEYWORDS
field (ENA:KW
line):the term “TLS”
the term “Targeted Locus Study”
the term “AIRR”
the term “MiAIRR:<x>.<y>” with <x> and <y> indicating the used version and subversion of the MiAIRR standard.
REQUIRED: DOI of the associated free-text record containing the information on data processing (MiAIRR data set 5) as
REMARK
within aREFERENCE
[4] (ENA:RX
line).OPTIONAL: The use of structured comments is currently evalutated for use in future versions of the MiAIRR standard.
Feature table#
The feature table, indicated by FEATURES
(ENA: RX
line), MUST or
SHOULD contain the following keys/qualifiers:
General sequence information#
REQUIRED: key
source
containing the following qualifiers:REQUIRED: qualifier
/organism
(required by [INSDC_FT]).REQUIRED: qualifier
/mol_type
(required by [INSDC_FT]).REQUIRED: qualifier
/citation
pointing to the reference in the header (REFERENCE
, ENA:RN
line) that links to the data set 5 document.REQUIRED: qualifier
/rearranged
[5].REQUIRED: qualifier
/note
containing theAIRR_READ_COUNT
keyword to indicate the read number used for the consensus. The criteria for selecting these reads and the procedure used to build the consensus SHOULD be reported as part of data set 5.OPTIONAL: qualifier
/note
containing theAIRR_INDEX_CELL
keyword for single-cell experiments. The value of the keyword SHOULD only contain alpha-numeric characters and MUST be identical for sequences derived from the same cell of origin.RECOMMENDED: qualifiers
/assembly_gap
and/linkage_evidence
to annotate non-overlapping paired-end sequences.RECOMMENDED: qualifier
/strain
, if/organism
is “Mus musculus”.
Note that additional qualifiers might be REQUIRED by GenBank to harmonize the GenBank record with the BioSample referenced by it in the header. A list of known BioSample keyword and GenBank qualifiers that MUST contain the same information can be found below. Whether (and in which direction) the existence of a keyword/qualifiers triggers a requirement in the corresponding record is currently unknown. Please report any undocumented requirements surfacing during submission to the MiAIRR team.
BioSample keyword |
GenBank FT qualifier |
---|---|
|
|
|
|
|
|
|
|
Segment and region annotation#
The following keys MUST be used for annotation according to their FT definition, if the respective item has been reported by the original study:
REQUIRED: key
V_region
. Note that this key MUST NOT be used to annotate V segment leader sequence [6] [7].REQUIRED: key
misc_feature
with coordinates identical to those given inV_region
. This key MUST contain a/note
qualifier that contains a string as value, which describes the general type of variable region described by the record. The string MUST match the regular expression/^(immunoglobulin (heavy|light)|T cell receptor (alpha|beta|gamma|delta)) chain variable region$/
This string will be used as record heading upon import into Genbank. Note that while this behavior of Genbank is undocumented, the procedure has been approved by NCBI.
REQUIRED: key
V_segment
, both coordinates MUST be withinV_region
. Note that this key MUST NOT be used to annotate V segment leader sequence [6] [7].REQUIRED: key
D_segment
, both coordinates MUST be withinV_region
. This key is only REQUIRED for sequences of applicable loci (IGH, TRB, TRD [8]). In the rare case of rearrangements using two D segments, this key SHOULD occur twice, but the coordinates of both keys MUST NOT overlap.REQUIRED: key
J_segment
, both coordinates MUST be withinV_region
.REQUIRED: key
C_region
, both coordinates MUST NOT overlap withV_region
. If the region can be unambiguously identified, the respective official gene symbol MUST be reported using the/gene
qualifier. If only the isotype (e.g. IgG) but not the subclass (e.g. IgG1) can be identified, a truncated gene symbol (e.g. IGHG instead of IGHG1) SHOULD be reported instead [9].
Each [VDJ]_segment
key MUST or SHOULD contain the following
qualifiers:
REQUIRED: qualifier
/gene
, containing the designation of the inferred segment, according to the database in the first/db_xref
entry. This qualifier MUST NOT contain any allele information.RECOMMENDED: qualifier
/allele
, containing the designation of the inferred allele, according to the database in the first/db_xref
entry. Note that while INSDC does not specify any format for this qualifier, AIRR compliance REQUIRES that this field only contains the allele string, i.e. without the gene name or separator characters.REQUIRED: qualifier
/db_xref
, linking to the reference record of the inferred segment in a germline database [INSDC_XREF]. This qualifier can be present multiple times, however only the first entry is mandatory and MUST link to the database used for the segment designation given with/gene
and (if present)/allele
.Note on referencing IMGT databases: There are two IMGT database available in the controlled vocabulary [INSDC_XREF]:
IMGT/GENE-DB
: This is the genome database, which requires that a reference sequence has been mapped to genomic DNA. When using this database as reference, note that you can only refer to the gene symbol not the allele. In the case of ambiguous allele calls (see below) this means that you MUST NOT annotate any/allele
at all. Nevertheless, this SHOULD be the default database for applications using IMGT as reference, as the sequence for each gene/allele is unique.IMGT/LIGM
: This database collects sequences described in INSDC databases (GenBank/ENA/DDBJ). As it might contain multiple entries representing a given gene/allele, it is NOT RECOMMENDED to use it unless that inference gene/allele is only present inIMGT/LIGM
and not inIMGT/GENE-DB
.
RECOMMENDED:
/inference
to indicate the tool used for segment inference. The description string SHOULD useCOORDINATES
as category andaligment
as type [INSDC_FT].
Annotation of sequences producing multiple hits with identical scores is problematic and is ultimately at the discretion of the depositing researcher. However, the algorithms used for tie-breaking SHOULD be documented in data set 5. In addition, the following procedures MUST be followed:
Certain gene, ambiguous allele: If multiple alleles of the same gene match to the sequence, the
/allele
qualifier MUST NOT be used. As the REQUIRED/db_xref
qualifier will ofter refer to a specific allele, all equal hits SHOULD be annoted via this qualifier (which can be use multiple times). Also see the note on the limitations of the IMGT/GENE-DB reference database above.Ambiguous gene: Pick one, annotate using the qualifiers as noted for ambiguous allele.
JUNCTION annotation#
INSDC does currently not define a key to annotate JUNCTION [10]. Therefore the following procedure MUST be used:
REQUIRED: key
CDS
, indicating the positions ofthe first bp of the first AA of JUNCTION
the last bp of the last AA of JUNCTION as determined by the utilized V(D)J inference tool.
Open coordinates MUST be used for both coordinates to allow for automated creation of the
/translated
qualifier providing the peptide sequence. Further note that a non-productive JUNCTION can have a length not divisible by three. This key contains the following qualifiers:REQUIRED: qualifier
/codon_start
with the assigned value “1”.REQUIRED: qualifier
/function
with the assigned value “JUNCTION”.REQUIRED: qualifier
/product
with an assigned value matching the regular expression/^(immunoglobulin (heavy|light)|T cell receptor (alpha|beta|gamma|delta)) chain junction region$/
The variable region referred to in the string MUST be the same as the one given in the
misc_feature
key.RECOMMENDED: qualifier
/inference
, indicating the tool used for positional inference. The description string SHOULD useCOORDINATES
as category andprotein motif
as type [INSDC_FT].FORBIDDEN: qualifier
/translated
, which will be automatically added by Genbank.
Note that the complete
CDS
key will be removed by Genbank if the translation contains stop codons or to many “N” (exact number unknown). As such a record will lack a central piece of REQUIRED information it is RECOMMENDED that submitters eitherremove the complete record or
replace the
CDS
with amisc_feature
key while at the same time removing the/codon_start
and/product
qualifiers
upfront, as described in the submission manual. If the submitter chooses the replacement option, it has to be ensured that the annotated coordinates are actually valid and not affect by the frame- shift.
Record body#
The record body starts after ORIGIN
(ENA: SQ
line) and MUST
contain:
the consensus sequence
References#
IMGT-ONTOLOGY definitions. <http://www.imgt.org/ligmdb/label#JUNCTION>
The DDBJ/ENA/GenBank Feature Table Definition. <http://www.insdc.org/documents/feature-table>
European Nucleotide Archive Annotated/Assembled Sequences User Manual. <http://ftp.ebi.ac.uk/pub/databases/ena/sequence/release/doc/usrman.txt>
GenBank Sample Record. <https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html>
Controlled vocabulary for /db_xref
qualifier.
<http://www.insdc.org/documents/dbxref-qualifier-vocabulary>
SRA Handbook. <https://www.ncbi.nlm.nih.gov/books/NBK47528/>
Footnotes#
Appendix#
Example record (GenBank format)#
LOCUS AB123456 420 bp mRNA linear EST 01-JAN-2015
DEFINITION TLS: Mus musculus immunoglobulin heavy chain variable region,
sequence.
ACCESSION AB123456
VERSION AB123456.7
KEYWORDS TLS; Targeted Locus Study; AIRR; MiAIRR:1.0.
SOURCE Mus musculus
ORGANISM Mus musculus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Glires;
Rodentia; Sciurognathi; Muroidea; Muridae; Murinae; Mus.
REFERENCE 1 (bases 1 to 420)
AUTHORS Stibbons,P.
TITLE Section 5 information for experiment FOO1
JOURNAL published (01-JAN-2000) on Zenodo
REMARK DOI:10.1000/0000-12345678
REFERENCE 2 (bases 1 to 420)
AUTHORS Stibbons,P.
TITLE Direct Submission
JOURNAL Submitted (01-JAN-2000) Center for Transcendental Immunology,
Unseen University, Ankh-Morpork, 12345, DISCWORLD
DBLINK BioProject: PRJNA000001
BioSample: SAMN000001
Sequence Read Archive: SRR0000001
FEATURES Location/Qualifiers
source 1..420
/organism="Mus musculus"
/mol_type="mRNA"
/strain="C57BL/6J"
/citation=[1]
/rearranged
/note="AIRR_READ_COUNT:123”
V_region 1..324
misc_feature 1..324
/note="immunoglobulin heavy chain variable region"
V_segment 1..257
/gene="IGHV1-34"
/allele="01"
/db_xref="IMGT/LIGM:AC073565"
/inference="COORDINATES:alignment:IgBLAST:1.6"
D_segment 266..272
/gene="IGHD2-2"
/allele="01"
/db_xref="IMGT/LIGM:AJ851868"
/inference="COORDINATES:alignment:IgBLAST:1.6"
J_segment 291..324
/gene="IGHJ4"
/allele="01"
/db_xref="IMGT/LIGM:V00770"
/inference="COORDINATES:alignment:IgBLAST:1.6"
CDS <258..>290
/codon_start=1
/function="JUNCTION"
/product="immunoglobulin heavy chain junction region"
/inference="COORDINATES:protein motif:IgBLAST:1.6"
/translated="CARAGVYDGYTMDYW"
C_region 325..420
/gene="Ighg2c"
ORIGIN
1 agcctggggc ttcagtgaag atgtcctgca aggcttctgg ctacacattc actgactata
61 acatacactg ggtgaagcag agccatggaa agagccttga gtggattgca tatattaatc
121 ctaacaatgg tggttatggc tataacgaca agttcaggga caaggccaca ttgactgtcg
181 acaggtcatc caacacagcc tacatggggc tccgcagcct gacctctgag gactctgcag
241 tctattactg tgcaagagcg ggagtttacg acggatatac tatggactac tggggtcaag
301 gaacctcagt caccgtctcc tcagccaaaa caacagcccc atcggtctat ccactggccc
361 ctgtgtgtgg aggtacaact ggctcctcgg tgactctagg atgcctggtc aagggcaact
//
Glossary#
MUST / REQUIRED: Indicates that an element or action is necessary to conform to the standard.
SHOULD / RECOMMENDED: Indicates that an element or action is considered to be best practice by AIRR, but not necessary to conform to the standard.
CAN / OPTIONAL: Indicates that it is at the discretion of the user to use an element or perform an action.
MUST NOT / FORBIDDEN: Indicates that an element or action will be in conflict with the standard.
Abbreviations#
AA: amino acid
bp: base pair
DOI: digital object identifier
FT: INSDC Feature Table
INSDC: International Nucleotide Sequence Database Collaboration
SRA: sequence read archive