MiAIRR Data Elements

MiAIRR Data Elements#

The AIRR Community has agreed to six high-level data sets that will guide the publication, curation and sharing of AIRR-Seq data and metadata: Study and subject, sample collection, sample processing and sequencing, raw sequences, processing of sequence data, and processed AIRR sequences.

Download as TSV.

Set / Subset

Designation / Field

Type / Format

Level

Definition

Example

1 / study

Study ID
study_id

string
free text

important

Unique ID assigned by study registry such as one of the International Nucleotide Sequence Database Collaboration (INSDC) repositories.

PRJNA001

1 / study

Study title
study_title

string
free text

important

Descriptive study title

Effects of sun light exposure of the Treg repertoire

1 / study

Study type
study_type

Ontology
Ontology: { top_node: { id: NCIT:C63536, label: Study}}

important

Type of study design

id: NCIT:C15197, label: Case-Control Study

1 / study

Study inclusion/exclusion criteria
inclusion_exclusion_criteria

string
free text

important

List of criteria for inclusion/exclusion for the study

Include: Clinical P. falciparum infection; Exclude: Seropositive for HIV

1 / study

Grant funding agency
grants

string
free text

important

Funding agencies and grant numbers

NIH, award number R01GM987654

1 / study

Contributors
contributors

array of Contributor
**

essential

List of individuals who contributed to the study. Note that these are not necessarily identical with the authors on an associated manuscript or other scholarly communication. Further note that typically at least the three CRediT contributor roles “supervision”, “investigation” and “data curation” should be assigned. The coresponding author should be listed last.

1 / study

Relevant publications
pub_ids

array of string
**

important

Array of publications describing the rationale and/or outcome of the study as an array of CURIE objects such as a DOI or Pubmed ID. Where more than one publication is given, if there is a primary publication for the study it should come first.

[‘PMID:29144493’, ‘DOI:10.1038/ni.3873’]

1 / study

Keywords for study
keywords_study

array of string
**

important

Keywords describing properties of one or more data sets in a study. “contains_schema” keywords indicate that the study contains data objects from the AIRR Schema of that type (Rearrangement, Clone, Cell, Receptor) while the other keywords indicate that the study design considers the type of data indicated (e.g. it is possible to have a study that “contains_paired_chain” but does not “contains_schema_cell”).

[‘contains_ig’, ‘contains_schema_rearrangement’, ‘contains_schema_clone’, ‘contains_schema_cell’]

1 / subject

Subject ID
subject_id

string
free text

important

Subject ID assigned by submitter, unique within study. If possible, a persistent subject ID linked to an INSDC or similar repository study should be used.

SUB856413

1 / subject

Synthetic library
synthetic

boolean
true | false

essential

TRUE for libraries in which the diversity has been synthetically generated (e.g. phage display)

1 / subject

Organism
species

Ontology
Ontology: { top_node: { id: NCBITAXON:7776, label: Gnathostomata}}

essential

Binomial designation of subject’s species

id: NCBITAXON:9606, label: Homo sapiens

1 / subject

Sex
sex

string
free text

important

Biological sex of subject

female

1 / subject


age

TimeInterval
**

important

Age of subject expressed as a time interval. If singular time point then min == max in the time interval.

1 / subject

Age event
age_event

string
free text

important

Event in the study schedule to which Age refers. For NCBI BioSample this MUST be sampling. For other implementations submitters need to be aware that there is currently no mechanism to encode to potential delta between Age event and Sample collection time, hence the chosen events should be in temporal proximity.

enrollment

1 / subject

Ancestry population
ancestry_population

Ontology
Ontology: { top_node: { id: GAZ:00000448, label: geographic location}}

important

Broad geographic origin of ancestry (continent)

id: GAZ:00000459, label: South America

1 / subject


location_birth

Ontology
Ontology: { top_node: { id: GAZ:00000448, label: geographic location}}

important

Self-reported location of birth of the subject, preferred granularity is country-level

id: GAZ:00002939, label: Poland

1 / subject

Ethnicity
ethnicity

string
free text

important

Ethnic group of subject (defined as cultural/language-based membership)

English, Kurds, Manchu, Yakuts (and other fields from Wikipedia)

1 / subject

Race
race

string
free text

important

Racial group of subject (as defined by NIH)

White, American Indian or Alaska Native, Black, Asian, Native Hawaiian or Other Pacific Islander, Other

1 / subject

Strain name
strain_name

string
free text

important

Non-human designation of the strain or breed of animal used

C57BL/6J

1 / subject

Relation to other subjects
linked_subjects

string
free text

important

Subject ID to which Relation type refers

SUB1355648

1 / subject

Relation type
link_type

string
free text

important

Relation between subject and linked_subjects, can be genetic or environmental (e.g.exposure)

father, daughter, household

1 / diagnosis and intervention

Study group description
study_group_description

string
free text

important

Designation of study arm to which the subject is assigned to

control

1 / diagnosis and intervention

Diagnosis timepoint
diagnosis_timepoint

TimePoint
**

important

Time point for the diagnosis

OrderedDict([(‘label’, ‘Study enrollment’), (‘value’, 60), (‘unit’, OrderedDict([(‘id’, ‘UO:0000033’), (‘label’, ‘day’)]))])

1 / diagnosis and intervention

Diagnosis
disease_diagnosis

Ontology
Ontology: { top_node: { id: DOID:4, label: disease}}

important

Diagnosis of subject

id: DOID:9538, label: multiple myeloma

1 / diagnosis and intervention

Length of disease
disease_length

TimeQuantity
**

important

Time duration between initial diagnosis and current intervention

OrderedDict([(‘quantity’, 23), (‘unit’, OrderedDict([(‘id’, ‘UO:0000035’), (‘label’, ‘month’)]))])

1 / diagnosis and intervention

Disease stage
disease_stage

string
free text

important

Stage of disease at current intervention

Stage II

1 / diagnosis and intervention

Prior therapies for primary disease under study
prior_therapies

string
free text

important

List of all relevant previous therapies applied to subject for treatment of Diagnosis

melphalan/prednisone

1 / diagnosis and intervention

Immunogen/agent
immunogen

string
free text

important

Antigen, vaccine or drug applied to subject at this intervention

bortezomib

1 / diagnosis and intervention

Intervention definition
intervention

string
free text

important

Description of intervention

systemic chemotherapy, 6 cycles, 1.25 mg/m2

1 / diagnosis and intervention

Other relevant medical history
medical_history

string
free text

important

Medical history of subject that is relevant to assess the course of disease and/or treatment

MGUS, first diagnosed 5 years prior

2 / sample

Biological sample ID
sample_id

string
free text

important

Sample ID assigned by submitter, unique within study. If possible, a persistent sample ID linked to INSDC or similar repository study should be used.

SUP52415

2 / sample

Sample type
sample_type

string
free text

important

The way the sample was obtained, e.g. fine-needle aspirate, organ harvest, peripheral venous puncture

Biopsy

2 / sample

Tissue
tissue

Ontology
Ontology: { top_node: { id: UBERON:0010000, label: multicellular anatomical structure}}

important

The actual tissue sampled, e.g. lymph node, liver, peripheral blood

id: UBERON:0002371, label: bone marrow

2 / sample

Anatomic site
anatomic_site

string
free text

important

The anatomic location of the tissue, e.g. Inguinal, femur

Iliac crest

2 / sample

Disease state of sample
disease_state_sample

string
free text

important

Histopathologic evaluation of the sample

Tumor infiltration

2 / sample

Sample collection time
collection_time_point_relative

TimePoint
**

important

Time point at which sample was taken, relative to label event

OrderedDict([(‘label’, ‘Primary vaccination’), (‘value’, 14), (‘unit’, OrderedDict([(‘id’, ‘UO:0000033’), (‘label’, ‘day’)]))])

2 / sample

Sample collection location
collection_location

Ontology
Ontology: { top_node: { id: GAZ:00000448, label: geographic location}}

important

Location where the sample was taken, preferred granularity is country-level

id: GAZ:00002939, label: Poland

2 / sample

Biomaterial provider
biomaterial_provider

string
free text

important

Name and address of the entity providing the sample

Tissues-R-Us, Tampa, FL, USA

3 / process (cell)

Tissue processing
tissue_processing

string
free text

important

Enzymatic digestion and/or physical methods used to isolate cells from sample

Collagenase A/Dnase I digested, followed by Percoll gradient

3 / process (cell)

Cell subset
cell_subset

Ontology
Ontology: { top_node: { id: CL:0000542, label: lymphocyte}}

important

Commonly-used designation of isolated cell population

id: CL:0000972, label: class switched memory B cell

3 / process (cell)

Cell subset phenotype
cell_phenotype

string
free text

important

List of cellular markers and their expression levels used to isolate the cell population

CD19+ CD38+ CD27+ IgM- IgD-

3 / process (cell)

Cell species
cell_species

Ontology
Ontology: { top_node: { id: NCBITAXON:7776, label: Gnathostomata}}

defined

Binomial designation of the species from which the analyzed cells originate. Typically, this value should be identical to species, in which case it SHOULD NOT be set explicitly. However, there are valid experimental setups in which the two might differ, e.g., chimeric animal models. If set, this key will overwrite the species information for all lower layers of the schema.

id: NCBITAXON:9606, label: Homo sapiens

3 / process (cell)

Single-cell sort
single_cell

boolean
true | false

important

TRUE if single cells were isolated into separate compartments

3 / process (cell)

Number of cells in experiment
cell_number

integer
positive integer

important

Total number of cells that went into the experiment

1000000

3 / process (cell)

Number of cells per sequencing reaction
cells_per_reaction

integer
positive integer

important

Number of cells for each biological replicate

50000

3 / process (cell)

Cell storage
cell_storage

boolean
true | false

important

TRUE if cells were cryo-preserved between isolation and further processing

True

3 / process (cell)

Cell quality
cell_quality

string
free text

important

Relative amount of viable cells after preparation and (if applicable) thawing

90% viability as determined by 7-AAD

3 / process (cell)

Cell isolation / enrichment procedure
cell_isolation

string
free text

important

Description of the procedure used for marker-based isolation or enrich cells

Cells were stained with fluorochrome labeled antibodies and then sorted on a FlowMerlin (CE) cytometer.

3 / process (cell)

Processing protocol
cell_processing_protocol

string
free text

important

Description of the methods applied to the sample including cell preparation/ isolation/enrichment and nucleic acid extraction. This should closely mirror the Materials and methods section in the manuscript.

Stimulated wih anti-CD3/anti-CD28

3 / process (nucleic acid)

Target substrate
template_class

string
free text

essential

The class of nucleic acid that was used as primary starting material for the following procedures

RNA

3 / process (nucleic acid)

Target substrate quality
template_quality

string
free text

important

Description and results of the quality control performed on the template material

RIN 9.2

3 / process (nucleic acid)

Template amount
template_amount

PhysicalQuantity
**

important

Amount of template that went into the process

OrderedDict([(‘quantity’, 1000), (‘unit’, OrderedDict([(‘id’, ‘UO:0000024’), (‘label’, ‘nanogram’)]))])

3 / process (nucleic acid)

Library generation method
library_generation_method

string
free text

essential

Generic type of library generation

RT(oligo-dT)+TS(UMI)+PCR

3 / process (nucleic acid)

Library generation protocol
library_generation_protocol

string
free text

important

Description of processes applied to substrate to obtain a library that is ready for sequencing

cDNA was generated using

3 / process (nucleic acid)

Protocol IDs
library_generation_kit_version

string
free text

important

When using a library generation protocol from a commercial provider, provide the protocol version number

v2.1 (2016-09-15)

3 / process (nucleic acid)

Complete sequences
complete_sequences

string
free text

essential

To be considered complete, the procedure used for library construction MUST generate sequences that 1) include the first V gene codon that encodes the mature polypeptide chain (i.e. after the leader sequence) and 2) include the last complete codon of the J gene (i.e. 1 bp 5’ of the J->C splice site) and 3) provide sequence information for all positions between 1) and 2). To be considered complete & untemplated, the sections of the sequences defined in points 1) to 3) of the previous sentence MUST be untemplated, i.e. MUST NOT overlap with the primers used in library preparation. mixed should only be used if the procedure used for library construction will likely produce multiple categories of sequences in the given experiment. It SHOULD NOT be used as a replacement of a NULL value.

partial

3 / process (nucleic acid)

Physical linkage of different rearrangements
physical_linkage

string
free text

essential

In case an experimental setup is used that physically links nucleic acids derived from distinct Rearrangements before library preparation, this field describes the mode of that linkage. All hetero_* terms indicate that in case of paired-read sequencing, the two reads should be expected to map to distinct IG/TR loci. *_head-head refers to techniques that link the 5’ ends of transcripts in a single-cell context. *_tail-head refers to techniques that link the 3’ end of one transcript to the 5’ end of another one in a single-cell context. This term does not provide any information whether a continuous reading-frame between the two is generated. *_prelinked refers to constructs in which the linkage was already present on the DNA level (e.g. scFv).

hetero_head-head

3 / process (nucleic acid [pcr])

Target locus for PCR
pcr_target_locus

string
free text

important

Designation of the target locus. Note that this field uses a controlled vocubulary that is meant to provide a generic classification of the locus, not necessarily the correct designation according to a specific nomenclature.

IGK

3 / process (nucleic acid [pcr])

Forward PCR primer target location
forward_pcr_primer_target_location

string
free text

important

Position of the most distal nucleotide templated by the forward primer or primer mix

IGHV, +23

3 / process (nucleic acid [pcr])

Reverse PCR primer target location
reverse_pcr_primer_target_location

string
free text

important

Position of the most proximal nucleotide templated by the reverse primer or primer mix

IGHG, +57

3 / process (sequencing)

Batch number
sequencing_run_id

string
free text

important

ID of sequencing run assigned by the sequencing facility

160101_M01234

3 / process (sequencing)

Total reads passing QC filter
total_reads_passing_qc_filter

integer
positive integer

important

Number of usable reads for analysis

10365118

3 / process (sequencing)

Sequencing platform
sequencing_platform

string
free text

important

Designation of sequencing instrument used

Alumina LoSeq 1000

3 / process (sequencing)

Sequencing facility
sequencing_facility

string
free text

important

Name and address of sequencing facility

Seqs-R-Us, Vancouver, BC, Canada

3 / process (sequencing)

Date of sequencing run
sequencing_run_date

string
free text

important

Date of sequencing run

2016-12-16

3 / process (sequencing)

Sequencing kit
sequencing_kit

string
free text

important

Name, manufacturer, order and lot numbers of sequencing kit

FullSeq 600, Alumina, #M123456C0, 789G1HK

4 / data (raw reads)

Raw sequencing data persistent identifier
sequencing_data_id

string
free text

important

Persistent identifier of raw data stored in an archive (e.g. INSDC run ID). Data archive should be identified in the CURIE prefix.

SRA:SRR11610494

4 / data (raw reads)

Raw sequencing data file type
file_type

string
free text

important

File format for the raw reads or sequences

4 / data (raw reads)

Raw sequencing data file name
filename

string
free text

important

File name for the raw reads or sequences. The first file in paired-read sequencing.

MS10R-NMonson-C7JR9_S1_R1_001.fastq

4 / data (raw reads)

Read direction
read_direction

string
free text

important

Read direction for the raw reads or sequences. The first file in paired-read sequencing.

forward

4 / data (raw reads)

Forward read length
read_length

integer
positive integer

important

Read length in bases for the first file in paired-read sequencing

300

4 / data (raw reads)

Paired raw sequencing data file name
paired_filename

string
free text

important

File name for the second file in paired-read sequencing

MS10R-NMonson-C7JR9_S1_R2_001.fastq

4 / data (raw reads)

Paired read direction
paired_read_direction

string
free text

important

Read direction for the second file in paired-read sequencing

reverse

4 / data (raw reads)

Paired read length
paired_read_length

integer
positive integer

important

Read length in bases for the second file in paired-read sequencing

300

5 / process (computational)

Software tools and version numbers
software_versions

string
free text

important

Version number and / or date, include company pipelines

IgBLAST 1.6

5 / process (computational)

Paired read assembly
paired_reads_assembly

string
free text

important

How paired end reads were assembled into a single receptor sequence

PandaSeq (minimal overlap 50, threshold 0.8)

5 / process (computational)

Quality thresholds
quality_thresholds

string
free text

important

How/if sequences were removed from (4) based on base quality scores

Average Phred score >=20

5 / process (computational)

Primer match cutoffs
primer_match_cutoffs

string
free text

important

How primers were identified in the sequences, were they removed/masked/etc?

Hamming distance <= 2

5 / process (computational)

Collapsing method
collapsing_method

string
free text

important

The method used for combining multiple sequences from (4) into a single sequence in (5)

MUSCLE 3.8.31

5 / process (computational)

Data processing protocols
data_processing_protocols

string
free text

important

General description of how QC is performed

Data was processed using […]

5 / data (processed sequence)

V(D)J germline reference database
germline_database

string
free text

important

Source of germline V(D)J genes with version number or date accessed.

ENSEMBL, Homo sapiens build 90, 2017-10-01