Release Notes#

Schema Release Notes#

Version 1.5.0: August 29, 2023#

Version 1.5 schema release.

General Schema Changes:

  1. Fixed synchronization errors between the OpenAPI v2 and v3 versions of the AIRR Schema (airr-schema.yaml and airr-schema-openapi3.yaml).

  2. Set the default value of x-airr.miarr attributes to defined.

  3. Converted all x-airr.format attribute values to snake_case, which specifically impacts any instance of controlled vocabulary or physical quantity.

  4. Corrected numerous instances of missing x-airr.miairr and x-airr.identifier attributes.

  5. Replaced x-airr.adc-api-optional attribute with x-airr.adc-query-support in multiple fields.

  6. Added “IGI” as a valid value to the locus enum fields in multiple schema.

  7. Added null as a valide value to all nullable enum fields.

  8. Removed discriminator: AIRR from all object definitions.

Germline and Genotype Schema:

  1. Clarified the descriptions of multiple fields in the Germline and Genotype schema.

  2. Modified x-airr: nullable and x-airr: identifier values on multiple fields in the Germline and Genotype schema.

  3. Removed the alignment field and added the unaligned_sequence, aligned_sequences, and alignment_labels fields to the SequenceDelineationV object.

  4. Converted the enum values in the inference_type field of AlleleDescription to snake_case.

  5. Added the allele_similarity_cluster_designation and allele_similarity_cluster_member_id fields to AlleleDescription.

  6. Moved the nested objects DocumentedAllele, UndocumentedAllele, and DeletedGenes out of Genotype and defined them as top-level objects references by the documented_alleles, undocumented_alleles, and deleted_genes fields, respectively.

  7. Moved the nested object MHCAllele out of MHCGenotype and defined it as a top-level object referenced by the mhc_alleles field.

Single-cell Schema:

  1. Added the property_type field to the CellExpression object.

  2. Moved the nested ReceptorReactivity object out of Receptor and defined it as a top-level object referenced by the reactivity_measurements field.

Subject Schema:

  1. Removed the nested references to GenotypeSet and MHCGenotypeSet in the genotype field and modified the definition to point to a top-level SubjectGenotype object defining these references.

DataProcessing Schema:

  1. Clarified the description of quality_thresholds to indicate that quality filtering is not mandatory.

Version 1.4.1: August 27, 2022#

Version 1.4 schema release.

New General Purpose Schema:

  1. Introduced the experimental DataFile object, which defines a JSON file holding Repertoire metadata, data processing analysis objects, or any object in the AIRR Data Model.

  2. Introduced the experimental RepertoireGroup Schema for describing collections of repertoires to be analyzed together.

  3. Introduced the experimental InfoObject Schema, which provides information about data and ADC API responses.

  4. Introduced the experimental TimePoint Schema for defining the time point at which an observation or other action was performed.

New Germline and Genotype Schema:

The following experimental schema were introduced to support storage of VDJ germline reference sequences, VDJ genotypes, and MHC genotypes:

  1. GermlineSet: Defines a collection of AlleleDescriptions from the same strain or species.

  2. AlleleDescription: Details of a putative or confirmed Ig receptor gene/allele inferred from one or more observations.

  3. RearrangedSequence: Details of a directly observed rearranged sequence or an inference from rearranged sequences contributing support for a gene or allele.

  4. UnrearrangedSequence: Details of an unrearranged sequence contributing support for a gene or allele.

  5. SequenceDelineationV: Delineation of a V-gene in a particular system.

  6. GenotypeSet: Defines a collection a VDJ genotypes for a given subject.

  7. Genotype: Enumerates the alleles and gene deletions inferred in a single subject for a single locus.

  8. MHCGenotypeSet: Defines a collection of MHC genotypes for a given subject.

  9. MHCGenotype: Details the genotype of major histocompatibility complex (MHC) class I, class II and non-classical loci.

  10. Acknowledgement: Defines contributors to the germline or genotype description.

New Single-cell Schema:

The following experimental schema were introduced to improve support for single-cell data and extend the Cell schema.

  1. CellExpression: Defines a container to store single-cell expression level measurements.

  2. Receptor: Describes a complete receptor protein sequence and its reactivity.

Rearrangement Schema:

  1. Added the optional fields v_frameshift, j_frameshift, d_frame and d2_frame defining annotations related to alignment reading frames.

  2. Added the optional field umi_count to represent the count of distinct UMIs for a sequence.

  3. Modified the definition of duplicate_count to remove ambiguity with the new umi_count field in a single-cell context. There is now a distinction between duplicate observed sequences (duplicate_count) and UMIs (umi_count).

  4. The optional quality and quality_alignment alignment fields were added to store Phred quality scores for base calls in the sequence and sequence_alignment fields, respectively.

  5. The following optional fields were added to denote constant region (c_call) alignment positions: c_sequence_start, c_sequence_end, c_germline_start, c_germline_end, c_alignment_start, c_alignment_end.

Study Schema:

  1. Added the optional fields study_contact to store contact information for the primary study contact.

  2. Modified the enumerated values supported by keywords_study to the following set: contains_ig, contains_tr, contains_paired_chain, contains_schema_rearrangement, contains_schema_clone, contains_schema_cell, contains_schema_receptor

  3. Added the optional fields adc_publish_date and adc_update_data that timestamp AIRR Data Commons initial publication and last update, respectively.

Subject Schema:

  1. Added the optional genotype field linking to the new GenotypeSet and MHCGenotypeSet objects.

Sample Schema:

  1. Added the required field collection_time_point_relative_unit defining the units for the sample collection timestamp.

  2. Modified the type of the field collection_time_point_relative from a string to a number defined in combination with the new unit ontology field collection_time_point_relative_unit.

NucleicAcidProcessing Schema:

  1. Added the required field template_amount_unit defining the units for the input template quantification.

  2. Modified the type of the template_amount field from a string to a number defined in the combination with the new unit ontology field ``template_amount_unit`.

Clone Schema:

  1. Added the optional clone_count field to specify absolute count of clonal members.

  2. Added the optional umi_count field to specify the total UMI count of all clonal members.

Cell Schema:

  1. Removed the field expression_tabular whose functionality has been replaced by the new CellExpression schema.

Version 1.3.1: October 13, 2020#

Version 1.3 documentation patch release.

Alignment Schema:

  1. Added the deprecation tags for rearrangement_id, which were accidentally left out of the v1.3.0 release.

Version 1.3.0: May 28, 2020#

Version 1.3 schema release.

New Schema:

  1. Introduced the Repertoire Schema for describing study meta data.

  2. Introduced the PCRTarget Schema for describing primer target locations.

  3. Introduced the SampleProcessing Schema for describing experimental processing steps for a sample.

  4. Replaced the SoftwareProcessing schema with the DataProcessing schema.

  5. Introduced experimental schema for clonal clusters, lineage trees, tree nodes, and cells as Clone, Tree, Node, and Cell objects, respectively.

General Updates:

  1. Added multiple additional attributes to a large number of schema propertes as AIRR extension attributes in the x-airr field. The new Attributes object contains definitions for these x-airr field attributes.

  2. Added the top level required property to all relevant schema objects.

  3. Added the title attribute containing the short, descriptive name to all relevant schema object fields.

  4. Added an example attribute containing an example data value to multiple schema object fields.

AIRR Data Commons API:

  1. Added OpenAPI V2 specification (specs/adc-api.yaml) for AIRR Data Commons API major version 1.

Ontology Support:

  1. Added Ontology and CURIEResolution objects to support ontologies.

  2. Added vocabularies/ontologies as JSON string for: Cell subset, Target substrate, Library generation method, Complete sequences, Physical linkage of different loci.

Rearrangement Schema:

  1. Added the complete_vdj field to annotate whether a V(D)J alignment was full length.

  2. Added the junction_length_aa field defining the length of the junction amino acid sequence.

  3. Added the repertoire_id, sample_processing_id, and data_processing_id fields to serve as linkers to the appropriate metadata objects.

  4. Added a controlled vocabulary to the locus field: IGH, IGI, IGK, IGL, TRA, TRB, TRD, TRG.

  5. Deprecated the rearrangement_set_id and germline_database fields.

  6. Deprecated rearrangement_id field and made the sequence_id field be the primary unique identifer for a rearrangement record, both in files and data repositories.

  7. Added support secondary D gene rearrangement through the additional fields: d2_call, d2_score, d2_identity, d2_support, d2_cigar np3, np3_aa, np3_length, n3_length, p5d2_length, p3d2_length, d2_sequence_start, d2_sequence_end, d2_germline_start, d2_germline_start, d2_alignment_start, d2_alignment_end, d2_sequence_alignment, d2_sequence_alignment_aa, d2_germline_alignment, d2_germline_alignment_aa.

  8. Updated field definitions with more concise V(D)J call descriptions.

Alignment Schema:

  1. Deprecated the rearrangement_set_id and germline_database fields.

  2. Added the data_processing_id field.

Study Schema:

  1. Added the study_type field containing an ontology defined term for the study design.

Subject Schema:

  1. Deprecated the organism field in favor of the new species field.

  2. Deprecated the age field.

  3. Introduced age ranges: age_min, age_max, and age_unit.

Diagnosis Schema:

  1. Changed the type of the disease_diagnosis field from string to Ontology.

Sample Schema:

  1. Changed the type of the tissue field from string to Ontology.

CellProcessing Schema:

  1. Changed the type of the cell_subset field from string to Ontology.

  2. Introduced the cell_species field which denotes the species from which the analyzed cells originate.

NucleicAcidProcessing Schema:

  1. Defined the template_class field as type string.

  2. Added a controlled vocabulary the library_generation_method field.

  3. Changed the controlled vocabulary terms of complete_sequences. Replacing complete & untemplated with complete+untemplated and adding mixed.

  4. Added the pcr_target field referencing the new PCRTarget schema object.

SequencingRun Schema:

  1. Added the sequencing_run_id field which serves as the object identifer field.

  2. Added the sequencing_files field which links to the RawSequenceData schema objects defining the raw read data.

RawSequenceData Schema:

  1. Added the file_type field defining the sequence file type. This field is a controlled vocabulary restricted to: fasta, fastq.

  2. Added the paired_read_length field defining mate-pair read lengths.

  3. Defined the read_direction and paired_read_direction fields as type string.

DataProcessing Schema:

  1. Replaces the SoftwareProcessing object.

  2. Added data_processing_id, primary_annotation, data_processing_files, germline_database and analysis_provenance_id fields.

Version 1.2.1: Oct 5, 2018#

Minor patch release.

  1. Schema gene vs segment terminology corrections

  2. Added Info object

  3. Updated cell_subset URL in AIRR schema

Version 1.2.0: Aug 18, 2018#

Peer reviewed released of the Rearrangement schema.

  1. Definition change for the coordinate fields of the Rearrangement and Alignment schema. Coordinates are now defined as 1-based closed intervals, instead of 0-based half-open intervals (as previously defined in v1.1 of the schema).

  2. Removed foreign study_id fields

  3. Introduced keywords_study field

Version 1.1.0: May 3, 2018#

Initial public released of the Rearrangement and Alignment schemas.

  1. Added required and nullable constrains to AIRR schema.

  2. Schema definitions for MiAIRR attributes and ontology.

  3. Introduction of an x-airr object indicating if field is required by MiAIRR.

  4. Rename rearrangement_set_id to data_processing_id.

  5. Rename study_description to study_type.

  6. Added physical_quantity format.

  7. Raw sequencing files into separate schema object.

  8. Rename Attributes object.

  9. Added primary_annotation and repertoire_id.

  10. Added diagnosis to repertoire object.

  11. Added ontology for organism.

  12. Added more detailed specification of sequencing_run, repertoire and rearrangement.

  13. Added repertoire schema.

  14. Rename definitions.yaml to airr-schema.yaml.

  15. Removed c_call, c_score and c_cigar from required as this is not typical reference aligner output.

  16. Renamed vdj_score, vdj_identity, vdj_evalue, and vdj_cigar to score, identity, evalue, and cigar.

  17. Added missing c_identity and c_evalue fields to Rearrangement spec.

  18. Swapped order of N and S operators in CIGAR string.

  19. Some description clean up for consistency in Rearrangement spec.

  20. Remove repeated objects in definitions.yaml.

  21. Added Alignment object to definitions.yaml.

  22. Updated MiARR format consistency check TSV with junction change.

  23. Changed definition from functional to productive.

Version 1.0.1: Jan 9, 2018#

MiAIRR v1 official release and initial draft of Rearrangement and Alignment schemas.

Python Library Release Notes#

Version 1.5.0: August 29, 2023#

  1. Updated schema set and examples to v1.5.

  2. Officially dropped support for Python 2.

  3. Added check for valid enum values to schema validation routines.

  4. Set enum values to first defined value during template generation routines.

  5. Removed mock dependency installation in ReadTheDocs environments from setup.

  6. Improved package import time.

Version 1.4.1: August 27, 2022#

General:

  1. Updated pandas requirement to 0.24.0 or higher.

  2. Added support for missing integer values (NaN) in load_rearrangement by casting to the pandas Int64 data type.

  3. Added gzip support to read_rearrangement.

  4. Significant internal refactoring to improve schema generalizability, harmonize behavior between the python and R libraries, and prepare for AIRR Standards v2.0.

  5. Fixed a bug in the validate subcommand of airr-tools causing validation errors to only be reporting for the first invalid file when multiple files were specified on the command line.

Data Model and Schema:

  1. Added support for arrays of objects in a single JSON or YAML file.

  2. Added support for the AIRR Data File and associated schema (DataFile, Info). The Data File data format holds AIRR object of multiple types and is backwards compatible with Repertoire metadata.

  3. Added support for the new germline and genotyping schema (GermlineSet, GenotypeSet) and associated schema.

  4. Renamed schema.CachedSchema to schema.AIRRSchema.

  5. Removed specs/blank.airr.yaml.

Deprecations:

  1. Deprecated load_repertoire. Use read_airr instead.

  2. Deprecated write_repertoire. Use write_airr instead.

  3. Deprecated validate_repertoire. Use validate_airr instead.

  4. Deprecated repertoire_template. Use schema.RepertoireSchema.template instead.

  5. Deprecated the commandline tool airr-tools validate repertoire. Use airr-tools validate airr instead.

Version 1.3.1: October 13, 2020#

  1. Refactored merge_rearrangement to allow for larger number of files.

  2. Improved error handling in format validation operations.

Version 1.3.0: May 30, 2020#

  1. Updated schema set to v1.3.

  2. Added load_repertoire, write_repertoire, and validate_repertoire to airr.interface to read, write and validate Repertoire metadata, respectively.

  3. Added repertoire_template to airr.interface which will return a complete repertoire object where all fields have null values.

  4. Added validate_object to airr.schema that will validate a single repertoire object against the schema.

  5. Extended the airr-tools commandline program to validate both rearrangement and repertoire files.

Version 1.2.1: October 5, 2018#

  1. Fixed a bug in the python reference library causing start coordinate values to be empty in some cases when writing data.

Version 1.2.0: August 17, 2018#

  1. Updated schema set to v1.2.

  2. Several improvements to the validate_rearrangement function.

  3. Changed behavior of all airr.interface functions to accept a file path (string) to a single Rearrangement TSV, instead of requiring a file handle as input.

  4. Added base argument to RearrangementReader and RearrangementWriter to support optional conversion of 1-based closed intervals in the TSV to python-style 0-based half-open intervals. Defaults to conversion.

  5. Added the custom exception ValidationError for handling validation checks.

  6. Added the validate argument to RearrangementReader which will raise a ValidationError exception when reading files with missing required fields or invalid values for known field types.

  7. Added validate argument to all type conversion methods in Schema, which will now raise a ValidationError exception for value that cannot be converted when set to True. When set False (default), the previous behavior of assigning None as the converted value is retained.

  8. Added validate_header and validate_row methods to Schema and removed validations methods from RearrangementReader.

  9. Removed automatic closure of file handle upon reaching the iterator end in RearrangementReader.

Version 1.1.0: May 1, 2018#

Initial release.

Release Notes#

Version 1.5.0: August 29, 2023#

  • Updated schema set and examples to v1.5.

Version 1.4.1: August 27, 2022#

Significant internal refactoring to improve schema generalizability, harmonize behavior between the python and R libraries, and prepare for AIRR Standards v2.0.

Rearrangement:

  • Added the aux_types argument to read_tabular, read_rearrangement, and read_alignment to allow explicit declaration of the type for fields that are not defined in the schema.

  • Renamed read_airr, write_airr, and validate_airr to read_tabular, validate_tabular, and validate_tabular, respectively.

Data Model and Schema:

  • Defined new read_airr, write_airr, and validate_airr functions that support AIRR Data Model files that store arrays of objects in JSON or YAML.

  • Added support for the AIRR Model Data File and associated schema (DataFile, Info). The Data File data format holds AIRR object of multiple types and is backwards compatible with Repertoire metadata.

  • Added support for the new germline and genotyping schema (GermlineSet, GenotypeSet) and associated schema.

Version 1.3.0: May 26, 2020#

  • Updated schema set to v1.3.

  • Added info slot to Schema object containing general schema information.

Version 1.2.0: August 17, 2018#

  • Updated schema set to v1.2.

  • Changed defaults to base="1" for read and write functions.

  • Updated example TSV file with coordinate changes, addition of germline_alignment data and simplification of sequence_id values.

Version 1.1.0: May 1, 2018#

Initial release.