MiAIRR Data Elements#

The AIRR Community has agreed to six high-level data sets that will guide the publication, curation and sharing of AIRR-Seq data and metadata: Study and subject, sample collection, sample processing and sequencing, raw sequences, processing of sequence data, and processed AIRR sequences.

Download as TSV.

Set / Subset	Designation / Field	Type / Format	Level	Definition	Example
1 / study	Study ID `study_id`	string free text	important	Unique ID assigned by study registry such as one of the International Nucleotide Sequence Database Collaboration (INSDC) repositories.	PRJNA001
1 / study	Study title `study_title`	string free text	important	Descriptive study title	Effects of sun light exposure of the Treg repertoire
1 / study	Study type `study_type`	Ontology Ontology: { top_node: { id: NCIT:C63536, label: Study}}	important	Type of study design	id: NCIT:C15197, label: Case-Control Study
1 / study	Study inclusion/exclusion criteria `inclusion_exclusion_criteria`	string free text	important	List of criteria for inclusion/exclusion for the study	Include: Clinical P. falciparum infection; Exclude: Seropositive for HIV
1 / study	Grant funding agency `grants`	string free text	important	Funding agencies and grant numbers	NIH, award number R01GM987654
1 / study	Contributors `contributors`	array of Contributor **	essential	List of individuals who contributed to the study. Note that these are not necessarily identical with the authors on an associated manuscript or other scholarly communication. Further note that typically at least the three CRediT contributor roles “supervision”, “investigation” and “data curation” should be assigned. The coresponding author should be listed last.
1 / study	Relevant publications `pub_ids`	array of string **	important	Array of publications describing the rationale and/or outcome of the study as an array of CURIE objects such as a DOI or Pubmed ID. Where more than one publication is given, if there is a primary publication for the study it should come first.	[‘PMID:29144493’, ‘DOI:10.1038/ni.3873’]
1 / study	Keywords for study `keywords_study`	array of string **	important	Keywords describing properties of one or more data sets in a study. “contains_schema” keywords indicate that the study contains data objects from the AIRR Schema of that type (Rearrangement, Clone, Cell, Receptor) while the other keywords indicate that the study design considers the type of data indicated (e.g. it is possible to have a study that “contains_paired_chain” but does not “contains_schema_cell”).	[‘contains_ig’, ‘contains_schema_rearrangement’, ‘contains_schema_clone’, ‘contains_schema_cell’]
1 / subject	Subject ID `subject_id`	string free text	important	Subject ID assigned by submitter, unique within study. If possible, a persistent subject ID linked to an INSDC or similar repository study should be used.	SUB856413
1 / subject	Synthetic library `synthetic`	boolean true \| false	essential	TRUE for libraries in which the diversity has been synthetically generated (e.g. phage display)
1 / subject	Organism `species`	Ontology Ontology: { top_node: { id: NCBITAXON:7776, label: Gnathostomata}}	essential	Binomial designation of subject’s species	id: NCBITAXON:9606, label: Homo sapiens
1 / subject	Sex `sex`	string free text	important	Biological sex of subject	female
1 / subject	`age`	TimeInterval **	important	Age of subject expressed as a time interval. If singular time point then min == max in the time interval.
1 / subject	Age event `age_event`	string free text	important	Event in the study schedule to which Age refers. For NCBI BioSample this MUST be sampling. For other implementations submitters need to be aware that there is currently no mechanism to encode to potential delta between Age event and Sample collection time, hence the chosen events should be in temporal proximity.	enrollment
1 / subject	Ancestry population `ancestry_population`	Ontology Ontology: { top_node: { id: GAZ:00000448, label: geographic location}}	important	Broad geographic origin of ancestry (continent)	id: GAZ:00000459, label: South America
1 / subject	`location_birth`	Ontology Ontology: { top_node: { id: GAZ:00000448, label: geographic location}}	important	Self-reported location of birth of the subject, preferred granularity is country-level	id: GAZ:00002939, label: Poland
1 / subject	Ethnicity `ethnicity`	string free text	important	Ethnic group of subject (defined as cultural/language-based membership)	English, Kurds, Manchu, Yakuts (and other fields from Wikipedia)
1 / subject	Race `race`	string free text	important	Racial group of subject (as defined by NIH)	White, American Indian or Alaska Native, Black, Asian, Native Hawaiian or Other Pacific Islander, Other
1 / subject	Strain name `strain_name`	string free text	important	Non-human designation of the strain or breed of animal used	C57BL/6J
1 / subject	Relation to other subjects `linked_subjects`	string free text	important	Subject ID to which Relation type refers	SUB1355648
1 / subject	Relation type `link_type`	string free text	important	Relation between subject and linked_subjects, can be genetic or environmental (e.g.exposure)	father, daughter, household
1 / diagnosis and intervention	Study group description `study_group_description`	string free text	important	Designation of study arm to which the subject is assigned to	control
1 / diagnosis and intervention	Diagnosis timepoint `diagnosis_timepoint`	TimePoint **	important	Time point for the diagnosis	OrderedDict([(‘label’, ‘Study enrollment’), (‘value’, 60), (‘unit’, OrderedDict([(‘id’, ‘UO:0000033’), (‘label’, ‘day’)]))])
1 / diagnosis and intervention	Diagnosis `disease_diagnosis`	Ontology Ontology: { top_node: { id: DOID:4, label: disease}}	important	Diagnosis of subject	id: DOID:9538, label: multiple myeloma
1 / diagnosis and intervention	Length of disease `disease_length`	TimeQuantity **	important	Time duration between initial diagnosis and current intervention	OrderedDict([(‘quantity’, 23), (‘unit’, OrderedDict([(‘id’, ‘UO:0000035’), (‘label’, ‘month’)]))])
1 / diagnosis and intervention	Disease stage `disease_stage`	string free text	important	Stage of disease at current intervention	Stage II
1 / diagnosis and intervention	Prior therapies for primary disease under study `prior_therapies`	string free text	important	List of all relevant previous therapies applied to subject for treatment of Diagnosis	melphalan/prednisone
1 / diagnosis and intervention	Immunogen/agent `immunogen`	string free text	important	Antigen, vaccine or drug applied to subject at this intervention	bortezomib
1 / diagnosis and intervention	Intervention definition `intervention`	string free text	important	Description of intervention	systemic chemotherapy, 6 cycles, 1.25 mg/m2
1 / diagnosis and intervention	Other relevant medical history `medical_history`	string free text	important	Medical history of subject that is relevant to assess the course of disease and/or treatment	MGUS, first diagnosed 5 years prior
2 / sample	Biological sample ID `sample_id`	string free text	important	Sample ID assigned by submitter, unique within study. If possible, a persistent sample ID linked to INSDC or similar repository study should be used.	SUP52415
2 / sample	Sample type `sample_type`	string free text	important	The way the sample was obtained, e.g. fine-needle aspirate, organ harvest, peripheral venous puncture	Biopsy
2 / sample	Tissue `tissue`	Ontology Ontology: { top_node: { id: UBERON:0010000, label: multicellular anatomical structure}}	important	The actual tissue sampled, e.g. lymph node, liver, peripheral blood	id: UBERON:0002371, label: bone marrow
2 / sample	Anatomic site `anatomic_site`	string free text	important	The anatomic location of the tissue, e.g. Inguinal, femur	Iliac crest
2 / sample	Disease state of sample `disease_state_sample`	string free text	important	Histopathologic evaluation of the sample	Tumor infiltration
2 / sample	Sample collection time `collection_time_point_relative`	TimePoint **	important	Time point at which sample was taken, relative to label event	OrderedDict([(‘label’, ‘Primary vaccination’), (‘value’, 14), (‘unit’, OrderedDict([(‘id’, ‘UO:0000033’), (‘label’, ‘day’)]))])
2 / sample	Sample collection location `collection_location`	Ontology Ontology: { top_node: { id: GAZ:00000448, label: geographic location}}	important	Location where the sample was taken, preferred granularity is country-level	id: GAZ:00002939, label: Poland
2 / sample	Biomaterial provider `biomaterial_provider`	string free text	important	Name and address of the entity providing the sample	Tissues-R-Us, Tampa, FL, USA
3 / process (cell)	Tissue processing `tissue_processing`	string free text	important	Enzymatic digestion and/or physical methods used to isolate cells from sample	Collagenase A/Dnase I digested, followed by Percoll gradient
3 / process (cell)	Cell subset `cell_subset`	Ontology Ontology: { top_node: { id: CL:0000542, label: lymphocyte}}	important	Commonly-used designation of isolated cell population	id: CL:0000972, label: class switched memory B cell
3 / process (cell)	Cell subset phenotype `cell_phenotype`	string free text	important	List of cellular markers and their expression levels used to isolate the cell population	CD19+ CD38+ CD27+ IgM- IgD-
3 / process (cell)	Cell species `cell_species`	Ontology Ontology: { top_node: { id: NCBITAXON:7776, label: Gnathostomata}}	defined	Binomial designation of the species from which the analyzed cells originate. Typically, this value should be identical to species, in which case it SHOULD NOT be set explicitly. However, there are valid experimental setups in which the two might differ, e.g., chimeric animal models. If set, this key will overwrite the species information for all lower layers of the schema.	id: NCBITAXON:9606, label: Homo sapiens
3 / process (cell)	Single-cell sort `single_cell`	boolean true \| false	important	TRUE if single cells were isolated into separate compartments
3 / process (cell)	Number of cells in experiment `cell_number`	integer positive integer	important	Total number of cells that went into the experiment	1000000
3 / process (cell)	Number of cells per sequencing reaction `cells_per_reaction`	integer positive integer	important	Number of cells for each biological replicate	50000
3 / process (cell)	Cell storage `cell_storage`	boolean true \| false	important	TRUE if cells were cryo-preserved between isolation and further processing	True
3 / process (cell)	Cell quality `cell_quality`	string free text	important	Relative amount of viable cells after preparation and (if applicable) thawing	90% viability as determined by 7-AAD
3 / process (cell)	Cell isolation / enrichment procedure `cell_isolation`	string free text	important	Description of the procedure used for marker-based isolation or enrich cells	Cells were stained with fluorochrome labeled antibodies and then sorted on a FlowMerlin (CE) cytometer.
3 / process (cell)	Processing protocol `cell_processing_protocol`	string free text	important	Description of the methods applied to the sample including cell preparation/ isolation/enrichment and nucleic acid extraction. This should closely mirror the Materials and methods section in the manuscript.	Stimulated wih anti-CD3/anti-CD28
3 / process (nucleic acid)	Target substrate `template_class`	string free text	essential	The class of nucleic acid that was used as primary starting material for the following procedures	RNA
3 / process (nucleic acid)	Target substrate quality `template_quality`	string free text	important	Description and results of the quality control performed on the template material	RIN 9.2
3 / process (nucleic acid)	Template amount `template_amount`	PhysicalQuantity **	important	Amount of template that went into the process	OrderedDict([(‘quantity’, 1000), (‘unit’, OrderedDict([(‘id’, ‘UO:0000024’), (‘label’, ‘nanogram’)]))])
3 / process (nucleic acid)	Library generation method `library_generation_method`	string free text	essential	Generic type of library generation	RT(oligo-dT)+TS(UMI)+PCR
3 / process (nucleic acid)	Library generation protocol `library_generation_protocol`	string free text	important	Description of processes applied to substrate to obtain a library that is ready for sequencing	cDNA was generated using
3 / process (nucleic acid)	Protocol IDs `library_generation_kit_version`	string free text	important	When using a library generation protocol from a commercial provider, provide the protocol version number	v2.1 (2016-09-15)
3 / process (nucleic acid)	Complete sequences `complete_sequences`	string free text	essential	To be considered complete, the procedure used for library construction MUST generate sequences that 1) include the first V gene codon that encodes the mature polypeptide chain (i.e. after the leader sequence) and 2) include the last complete codon of the J gene (i.e. 1 bp 5’ of the J->C splice site) and 3) provide sequence information for all positions between 1) and 2). To be considered complete & untemplated, the sections of the sequences defined in points 1) to 3) of the previous sentence MUST be untemplated, i.e. MUST NOT overlap with the primers used in library preparation. mixed should only be used if the procedure used for library construction will likely produce multiple categories of sequences in the given experiment. It SHOULD NOT be used as a replacement of a NULL value.	partial
3 / process (nucleic acid)	Physical linkage of different rearrangements `physical_linkage`	string free text	essential	In case an experimental setup is used that physically links nucleic acids derived from distinct Rearrangements before library preparation, this field describes the mode of that linkage. All hetero_* terms indicate that in case of paired-read sequencing, the two reads should be expected to map to distinct IG/TR loci. _head-head refers to techniques that link the 5’ ends of transcripts in a single-cell context. _tail-head refers to techniques that link the 3’ end of one transcript to the 5’ end of another one in a single-cell context. This term does not provide any information whether a continuous reading-frame between the two is generated. *_prelinked refers to constructs in which the linkage was already present on the DNA level (e.g. scFv).	hetero_head-head
3 / process (nucleic acid [pcr])	Target locus for PCR `pcr_target_locus`	string free text	important	Designation of the target locus. Note that this field uses a controlled vocubulary that is meant to provide a generic classification of the locus, not necessarily the correct designation according to a specific nomenclature.	IGK
3 / process (nucleic acid [pcr])	Forward PCR primer target location `forward_pcr_primer_target_location`	string free text	important	Position of the most distal nucleotide templated by the forward primer or primer mix	IGHV, +23
3 / process (nucleic acid [pcr])	Reverse PCR primer target location `reverse_pcr_primer_target_location`	string free text	important	Position of the most proximal nucleotide templated by the reverse primer or primer mix	IGHG, +57
3 / process (sequencing)	Batch number `sequencing_run_id`	string free text	important	ID of sequencing run assigned by the sequencing facility	160101_M01234
3 / process (sequencing)	Total reads passing QC filter `total_reads_passing_qc_filter`	integer positive integer	important	Number of usable reads for analysis	10365118
3 / process (sequencing)	Sequencing platform `sequencing_platform`	string free text	important	Designation of sequencing instrument used	Alumina LoSeq 1000
3 / process (sequencing)	Sequencing facility `sequencing_facility`	string free text	important	Name and address of sequencing facility	Seqs-R-Us, Vancouver, BC, Canada
3 / process (sequencing)	Date of sequencing run `sequencing_run_date`	string free text	important	Date of sequencing run	2016-12-16
3 / process (sequencing)	Sequencing kit `sequencing_kit`	string free text	important	Name, manufacturer, order and lot numbers of sequencing kit	FullSeq 600, Alumina, #M123456C0, 789G1HK
4 / data (raw reads)	Raw sequencing data persistent identifier `sequencing_data_id`	string free text	important	Persistent identifier of raw data stored in an archive (e.g. INSDC run ID). Data archive should be identified in the CURIE prefix.	SRA:SRR11610494
4 / data (raw reads)	Raw sequencing data file type `file_type`	string free text	important	File format for the raw reads or sequences
4 / data (raw reads)	Raw sequencing data file name `filename`	string free text	important	File name for the raw reads or sequences. The first file in paired-read sequencing.	MS10R-NMonson-C7JR9_S1_R1_001.fastq
4 / data (raw reads)	Read direction `read_direction`	string free text	important	Read direction for the raw reads or sequences. The first file in paired-read sequencing.	forward
4 / data (raw reads)	Forward read length `read_length`	integer positive integer	important	Read length in bases for the first file in paired-read sequencing	300
4 / data (raw reads)	Paired raw sequencing data file name `paired_filename`	string free text	important	File name for the second file in paired-read sequencing	MS10R-NMonson-C7JR9_S1_R2_001.fastq
4 / data (raw reads)	Paired read direction `paired_read_direction`	string free text	important	Read direction for the second file in paired-read sequencing	reverse
4 / data (raw reads)	Paired read length `paired_read_length`	integer positive integer	important	Read length in bases for the second file in paired-read sequencing	300
5 / process (computational)	Software tools and version numbers `software_versions`	string free text	important	Version number and / or date, include company pipelines	IgBLAST 1.6
5 / process (computational)	Paired read assembly `paired_reads_assembly`	string free text	important	How paired end reads were assembled into a single receptor sequence	PandaSeq (minimal overlap 50, threshold 0.8)
5 / process (computational)	Quality thresholds `quality_thresholds`	string free text	important	How/if sequences were removed from (4) based on base quality scores	Average Phred score >=20
5 / process (computational)	Primer match cutoffs `primer_match_cutoffs`	string free text	important	How primers were identified in the sequences, were they removed/masked/etc?	Hamming distance <= 2
5 / process (computational)	Collapsing method `collapsing_method`	string free text	important	The method used for combining multiple sequences from (4) into a single sequence in (5)	MUSCLE 3.8.31
5 / process (computational)	Data processing protocols `data_processing_protocols`	string free text	important	General description of how QC is performed	Data was processed using […]
5 / data (processed sequence)	V(D)J germline reference database `germline_database`	string free text	important	Source of germline V(D)J genes with version number or date accessed.	ENSEMBL, Homo sapiens build 90, 2017-10-01