MiAIRR Data Elements#
The AIRR Community has agreed to six high-level data sets that will guide the publication, curation and sharing of AIRR-Seq data and metadata: Study and subject, sample collection, sample processing and sequencing, raw sequences, processing of sequence data, and processed AIRR sequences.
Set / Subset |
Designation / Field |
Type / Format |
Level |
Definition |
Example |
---|---|---|---|---|---|
1 / study |
Study ID |
string |
important |
Unique ID assigned by study registry such as one of the International Nucleotide Sequence Database Collaboration (INSDC) repositories. |
PRJNA001 |
1 / study |
Study title |
string |
important |
Descriptive study title |
Effects of sun light exposure of the Treg repertoire |
1 / study |
Study type |
Ontology |
important |
Type of study design |
id: NCIT:C15197, label: Case-Control Study |
1 / study |
Study inclusion/exclusion criteria |
string |
important |
List of criteria for inclusion/exclusion for the study |
Include: Clinical P. falciparum infection; Exclude: Seropositive for HIV |
1 / study |
Grant funding agency |
string |
important |
Funding agencies and grant numbers |
NIH, award number R01GM987654 |
1 / study |
Contributors |
array of Contributor |
essential |
List of individuals who contributed to the study. Note that these are not necessarily identical with the authors on an associated manuscript or other scholarly communication. Further note that typically at least the three CRediT contributor roles “supervision”, “investigation” and “data curation” should be assigned. The coresponding author should be listed last. |
|
1 / study |
Relevant publications |
array of string |
important |
Array of publications describing the rationale and/or outcome of the study as an array of CURIE objects such as a DOI or Pubmed ID. Where more than one publication is given, if there is a primary publication for the study it should come first. |
[‘PMID:29144493’, ‘DOI:10.1038/ni.3873’] |
1 / study |
Keywords for study |
array of string |
important |
Keywords describing properties of one or more data sets in a study. “contains_schema” keywords indicate that the study contains data objects from the AIRR Schema of that type (Rearrangement, Clone, Cell, Receptor) while the other keywords indicate that the study design considers the type of data indicated (e.g. it is possible to have a study that “contains_paired_chain” but does not “contains_schema_cell”). |
[‘contains_ig’, ‘contains_schema_rearrangement’, ‘contains_schema_clone’, ‘contains_schema_cell’] |
1 / subject |
Subject ID |
string |
important |
Subject ID assigned by submitter, unique within study. If possible, a persistent subject ID linked to an INSDC or similar repository study should be used. |
SUB856413 |
1 / subject |
Synthetic library |
boolean |
essential |
TRUE for libraries in which the diversity has been synthetically generated (e.g. phage display) |
|
1 / subject |
Organism |
Ontology |
essential |
Binomial designation of subject’s species |
id: NCBITAXON:9606, label: Homo sapiens |
1 / subject |
Sex |
string |
important |
Biological sex of subject |
female |
1 / subject |
|
TimeInterval |
important |
Age of subject expressed as a time interval. If singular time point then min == max in the time interval. |
|
1 / subject |
Age event |
string |
important |
Event in the study schedule to which Age refers. For NCBI BioSample this MUST be sampling. For other implementations submitters need to be aware that there is currently no mechanism to encode to potential delta between Age event and Sample collection time, hence the chosen events should be in temporal proximity. |
enrollment |
1 / subject |
Ancestry population |
Ontology |
important |
Broad geographic origin of ancestry (continent) |
id: GAZ:00000459, label: South America |
1 / subject |
|
Ontology |
important |
Self-reported location of birth of the subject, preferred granularity is country-level |
id: GAZ:00002939, label: Poland |
1 / subject |
Ethnicity |
string |
important |
Ethnic group of subject (defined as cultural/language-based membership) |
English, Kurds, Manchu, Yakuts (and other fields from Wikipedia) |
1 / subject |
Race |
string |
important |
Racial group of subject (as defined by NIH) |
White, American Indian or Alaska Native, Black, Asian, Native Hawaiian or Other Pacific Islander, Other |
1 / subject |
Strain name |
string |
important |
Non-human designation of the strain or breed of animal used |
C57BL/6J |
1 / subject |
Relation to other subjects |
string |
important |
Subject ID to which Relation type refers |
SUB1355648 |
1 / subject |
Relation type |
string |
important |
Relation between subject and linked_subjects, can be genetic or environmental (e.g.exposure) |
father, daughter, household |
1 / diagnosis and intervention |
Study group description |
string |
important |
Designation of study arm to which the subject is assigned to |
control |
1 / diagnosis and intervention |
Diagnosis timepoint |
TimePoint |
important |
Time point for the diagnosis |
OrderedDict([(‘label’, ‘Study enrollment’), (‘value’, 60), (‘unit’, OrderedDict([(‘id’, ‘UO:0000033’), (‘label’, ‘day’)]))]) |
1 / diagnosis and intervention |
Diagnosis |
Ontology |
important |
Diagnosis of subject |
id: DOID:9538, label: multiple myeloma |
1 / diagnosis and intervention |
Length of disease |
TimeQuantity |
important |
Time duration between initial diagnosis and current intervention |
OrderedDict([(‘quantity’, 23), (‘unit’, OrderedDict([(‘id’, ‘UO:0000035’), (‘label’, ‘month’)]))]) |
1 / diagnosis and intervention |
Disease stage |
string |
important |
Stage of disease at current intervention |
Stage II |
1 / diagnosis and intervention |
Prior therapies for primary disease under study |
string |
important |
List of all relevant previous therapies applied to subject for treatment of Diagnosis |
melphalan/prednisone |
1 / diagnosis and intervention |
Immunogen/agent |
string |
important |
Antigen, vaccine or drug applied to subject at this intervention |
bortezomib |
1 / diagnosis and intervention |
Intervention definition |
string |
important |
Description of intervention |
systemic chemotherapy, 6 cycles, 1.25 mg/m2 |
1 / diagnosis and intervention |
Other relevant medical history |
string |
important |
Medical history of subject that is relevant to assess the course of disease and/or treatment |
MGUS, first diagnosed 5 years prior |
2 / sample |
Biological sample ID |
string |
important |
Sample ID assigned by submitter, unique within study. If possible, a persistent sample ID linked to INSDC or similar repository study should be used. |
SUP52415 |
2 / sample |
Sample type |
string |
important |
The way the sample was obtained, e.g. fine-needle aspirate, organ harvest, peripheral venous puncture |
Biopsy |
2 / sample |
Tissue |
Ontology |
important |
The actual tissue sampled, e.g. lymph node, liver, peripheral blood |
id: UBERON:0002371, label: bone marrow |
2 / sample |
Anatomic site |
string |
important |
The anatomic location of the tissue, e.g. Inguinal, femur |
Iliac crest |
2 / sample |
Disease state of sample |
string |
important |
Histopathologic evaluation of the sample |
Tumor infiltration |
2 / sample |
Sample collection time |
TimePoint |
important |
Time point at which sample was taken, relative to label event |
OrderedDict([(‘label’, ‘Primary vaccination’), (‘value’, 14), (‘unit’, OrderedDict([(‘id’, ‘UO:0000033’), (‘label’, ‘day’)]))]) |
2 / sample |
Sample collection location |
Ontology |
important |
Location where the sample was taken, preferred granularity is country-level |
id: GAZ:00002939, label: Poland |
2 / sample |
Biomaterial provider |
string |
important |
Name and address of the entity providing the sample |
Tissues-R-Us, Tampa, FL, USA |
3 / process (cell) |
Tissue processing |
string |
important |
Enzymatic digestion and/or physical methods used to isolate cells from sample |
Collagenase A/Dnase I digested, followed by Percoll gradient |
3 / process (cell) |
Cell subset |
Ontology |
important |
Commonly-used designation of isolated cell population |
id: CL:0000972, label: class switched memory B cell |
3 / process (cell) |
Cell subset phenotype |
string |
important |
List of cellular markers and their expression levels used to isolate the cell population. |
CD19+ CD38+ CD27+ IgM- IgD- |
3 / process (cell) |
Cell annotation |
string |
defined |
Free text cell type annotation. Primarily used for annotating cell types that are not provided in the Cell Ontology. |
age-associated B cell |
3 / process (cell) |
Cell species |
Ontology |
defined |
Binomial designation of the species from which the analyzed cells originate. Typically, this value should be identical to species, in which case it SHOULD NOT be set explicitly. However, there are valid experimental setups in which the two might differ, e.g., chimeric animal models. If set, this key will overwrite the species information for all lower layers of the schema. |
id: NCBITAXON:9606, label: Homo sapiens |
3 / process (cell) |
Single-cell sort |
boolean |
important |
TRUE if single cells were isolated into separate compartments |
|
3 / process (cell) |
Number of cells in experiment |
integer |
important |
Total number of cells that went into the experiment |
1000000 |
3 / process (cell) |
Number of cells per sequencing reaction |
integer |
important |
Number of cells for each biological replicate |
50000 |
3 / process (cell) |
Cell storage |
boolean |
important |
TRUE if cells were cryo-preserved between isolation and further processing |
True |
3 / process (cell) |
Cell quality |
string |
important |
Relative amount of viable cells after preparation and (if applicable) thawing |
90% viability as determined by 7-AAD |
3 / process (cell) |
Cell isolation / enrichment procedure |
string |
important |
Description of the procedure used for marker-based isolation or enrich cells |
Cells were stained with fluorochrome labeled antibodies and then sorted on a FlowMerlin (CE) cytometer. |
3 / process (cell) |
Processing protocol |
string |
important |
Description of the methods applied to the sample including cell preparation/ isolation/enrichment and nucleic acid extraction. This should closely mirror the Materials and methods section in the manuscript. |
Stimulated wih anti-CD3/anti-CD28 |
3 / process (nucleic acid) |
Target substrate |
string |
essential |
The class of nucleic acid that was used as primary starting material for the following procedures |
RNA |
3 / process (nucleic acid) |
Target substrate quality |
string |
important |
Description and results of the quality control performed on the template material |
RIN 9.2 |
3 / process (nucleic acid) |
Template amount |
PhysicalQuantity |
important |
Amount of template that went into the process |
OrderedDict([(‘quantity’, 1000), (‘unit’, OrderedDict([(‘id’, ‘UO:0000024’), (‘label’, ‘nanogram’)]))]) |
3 / process (nucleic acid) |
Library generation method |
string |
essential |
Generic type of library generation |
RT(oligo-dT)+TS(UMI)+PCR |
3 / process (nucleic acid) |
Library generation protocol |
string |
important |
Description of processes applied to substrate to obtain a library that is ready for sequencing |
cDNA was generated using |
3 / process (nucleic acid) |
Protocol IDs |
string |
important |
When using a library generation protocol from a commercial provider, provide the protocol version number |
v2.1 (2016-09-15) |
3 / process (nucleic acid) |
Complete sequences |
string |
essential |
To be considered complete, the procedure used for library construction MUST generate sequences that 1) include the first V gene codon that encodes the mature polypeptide chain (i.e. after the leader sequence) and 2) include the last complete codon of the J gene (i.e. 1 bp 5’ of the J->C splice site) and 3) provide sequence information for all positions between 1) and 2). To be considered complete & untemplated, the sections of the sequences defined in points 1) to 3) of the previous sentence MUST be untemplated, i.e. MUST NOT overlap with the primers used in library preparation. mixed should only be used if the procedure used for library construction will likely produce multiple categories of sequences in the given experiment. It SHOULD NOT be used as a replacement of a NULL value. |
partial |
3 / process (nucleic acid) |
Physical linkage of different rearrangements |
string |
essential |
In case an experimental setup is used that physically links nucleic acids derived from distinct Rearrangements before library preparation, this field describes the mode of that linkage. All hetero_* terms indicate that in case of paired-read sequencing, the two reads should be expected to map to distinct IG/TR loci. *_head-head refers to techniques that link the 5’ ends of transcripts in a single-cell context. *_tail-head refers to techniques that link the 3’ end of one transcript to the 5’ end of another one in a single-cell context. This term does not provide any information whether a continuous reading-frame between the two is generated. *_prelinked refers to constructs in which the linkage was already present on the DNA level (e.g. scFv). |
hetero_head-head |
3 / process (nucleic acid [pcr]) |
Target locus for PCR |
string |
important |
Designation of the target locus. Note that this field uses a controlled vocubulary that is meant to provide a generic classification of the locus, not necessarily the correct designation according to a specific nomenclature. |
IGK |
3 / process (nucleic acid [pcr]) |
Forward PCR primer target location |
string |
important |
Position of the most distal nucleotide templated by the forward primer or primer mix |
IGHV, +23 |
3 / process (nucleic acid [pcr]) |
Reverse PCR primer target location |
string |
important |
Position of the most proximal nucleotide templated by the reverse primer or primer mix |
IGHG, +57 |
3 / process (sequencing) |
Batch number |
string |
important |
ID of sequencing run assigned by the sequencing facility |
160101_M01234 |
3 / process (sequencing) |
Total reads passing QC filter |
integer |
important |
Number of usable reads for analysis |
10365118 |
3 / process (sequencing) |
Sequencing platform |
string |
important |
Designation of sequencing instrument used |
Alumina LoSeq 1000 |
3 / process (sequencing) |
Sequencing facility |
string |
important |
Name and address of sequencing facility |
Seqs-R-Us, Vancouver, BC, Canada |
3 / process (sequencing) |
Date of sequencing run |
string |
important |
Date of sequencing run |
2016-12-16 |
3 / process (sequencing) |
Sequencing kit |
string |
important |
Name, manufacturer, order and lot numbers of sequencing kit |
FullSeq 600, Alumina, #M123456C0, 789G1HK |
4 / data (raw reads) |
Raw sequencing data persistent identifier |
string |
important |
Persistent identifier of raw data stored in an archive (e.g. INSDC run ID). Data archive should be identified in the CURIE prefix. |
SRA:SRR11610494 |
4 / data (raw reads) |
Raw sequencing data file type |
string |
important |
File format for the raw reads or sequences |
|
4 / data (raw reads) |
Raw sequencing data file name |
string |
important |
File name for the raw reads or sequences. The first file in paired-read sequencing. |
MS10R-NMonson-C7JR9_S1_R1_001.fastq |
4 / data (raw reads) |
Read direction |
string |
important |
Read direction for the raw reads or sequences. The first file in paired-read sequencing. |
forward |
4 / data (raw reads) |
Forward read length |
integer |
important |
Read length in bases for the first file in paired-read sequencing |
300 |
4 / data (raw reads) |
Paired raw sequencing data file name |
string |
important |
File name for the second file in paired-read sequencing |
MS10R-NMonson-C7JR9_S1_R2_001.fastq |
4 / data (raw reads) |
Paired read direction |
string |
important |
Read direction for the second file in paired-read sequencing |
reverse |
4 / data (raw reads) |
Paired read length |
integer |
important |
Read length in bases for the second file in paired-read sequencing |
300 |
5 / process (computational) |
Software tools and version numbers |
string |
important |
Version number and / or date, include company pipelines |
IgBLAST 1.6 |
5 / process (computational) |
Paired read assembly |
string |
important |
How paired end reads were assembled into a single receptor sequence |
PandaSeq (minimal overlap 50, threshold 0.8) |
5 / process (computational) |
Quality thresholds |
string |
important |
How/if sequences were removed from (4) based on base quality scores |
Average Phred score >=20 |
5 / process (computational) |
Primer match cutoffs |
string |
important |
How primers were identified in the sequences, were they removed/masked/etc? |
Hamming distance <= 2 |
5 / process (computational) |
Collapsing method |
string |
important |
The method used for combining multiple sequences from (4) into a single sequence in (5) |
MUSCLE 3.8.31 |
5 / process (computational) |
Data processing protocols |
string |
important |
General description of how QC is performed |
Data was processed using […] |
5 / data (processed sequence) |
V(D)J germline reference database |
string |
important |
Source of germline V(D)J genes with version number or date accessed. |
ENSEMBL, Homo sapiens build 90, 2017-10-01 |